The Future of SOC

Over last few years we moved away from a SOC that used to be almost solely focused on Network and Windows events and artifacts (probably a strong fintech bias here) towards the one that is a Frankenstein’s monster we see today – very fractured, multi-dimensional, multi-platform, multi-architectural, multi-device and multi-everything-centric, plus certainly multi-regional (regulated markets, data across borders)/privacy-savvy, on- and off-prem, covering *aaS and endpoints/servers/mobile/virtualization/containers/CI/CD pipelines, and did I mention multi-cloud, public and private environments, vendor vs. proprietary, with a bonus of over-eager employees who keep sending ‘dangerous’ stuff to SOC because they have been trained well to report <insert any suspicious events here>’ ? And finally, one where NO ONE knows anymore at least the basics of all the existing, rapidly emerging, and more and more confusing technologies, let alone the gamut of ideas and solutions that help to address (or, at least detect) many of these security problems.

I think we moved away from a fairly understood model known between … let’s say 2000-2018 /COMFORTABLE/ towards the one that is (as of today at least) 2018-+ full of unknown unknowns /VERY UNCOMFORTABLE/…

How do we deal with it today?

Usually a bridge, a Slack or Teams channel with 100-200 people on it.

I think divide and conquer is the only way to deal with it. Also, more and more work has to focus on building bridges with internal owners of technologies / architects than ever before. This includes for instance, a lot of DevSecOps work, shifting left, early involvement in app development improvement and their release cycles, security-oriented feature and LOG requests, heavy red-team footprint on breaking it all, and in opposition to the previous decade – lots of very hands-off work. Lots of commanding and coordination.

Borrowing the quote from dreBlue Teaming is 90 percent social capital today.

Times have definitely changed…

And in parallel:
– stronger than ever reliance on vendors
– real (as in ‘old school’) cyber skills are in a strong decline — what took years to acquire and master is now gamified by vendor offerings that dumbify a lot of problems and requirements; I am not against it, because we need help and while sometimes it comes in a form of a b/s and extrapolations, we must admit that many non-technical analysts today, even w/o reading a single RFC in their life can easily handle many incidents by just… talking and via vendor consoles – this would be impossible 10 years ago
– seriously, tools of today are fantastic: advanced sandboxes, threat intel portals, bug bounty portals, and the whole social media sharing makes it far easier to find and share information that used to be available only to a few in the past
– the environments get more complicated — we need to work towards universal playbooks that cover heavily regulated regional markets
– portability is the key (work in one place -> work everywhere w/o many changes) — affects multiple instances of systems of records, SOPs, detections, metrics (again, regional/regulated markets, plus ability to quickly recover in case of a breach)
– follow the Sun is now more complicated as it includes Follow the Regulated Market
– from log deprivation to log over-saturation — time for some log governance, at source? common models for field naming? not only naming conventions, but also … and I really mean it… one, common, universal, … TIMESTAMP FORMAT?
– optimization efforts should be a norm — most of detection engineering, threat hunting teams add to the workload, we need an opposite force that asks — hmm is it really necessary? the same goes for a ruthless approach towards an email fatigue — convert to tickets or kill at source, disable, decommission
– how many emails your workflow and automation is sending today? can you trim it down?

The above is what scares me. It’s currently hardly manageable, and it’s not sustainable. It’s a whack-a-mole on steroids. We were meant to stop the whack-a-mole. And it not only happens now, it has intensified a lot in recent years and imho this trend will continue. When we all started to really like the idea of having EDRs at our disposal… the *aaS happened and there is no way back. Suddenly all our incident response playbooks, SOPs need to focus on a completely different type of threats. Lots of this work is actually more focused on a proper access management than infiltration by APT actors via well-known TTPs. A lot of work is also focused on shared responsibility — when you deal with alerts on-prem, on endpoints, it’s all nice and cozy, but when you are *aaS, there is a moment where an external password spray hitting the client application ran on the server you host, one has to decide when the transfer of security responsibility occurs. Is it a threat to the hosting environment? The instance of the app? Both? It’s… complicated.

What is the SOC of the future?

I think there is no SOC in the future. There is a cross-organizational incident response committee (you don’t wanna know how much I hate this word!) that actively engages in tackling issues at hand, and ‘incident commands’ respective teams leading the issues to a closure. Security becomes part of the day to day operations. Representatives from many functions actually are talking to each other, often, and the ‘old security’ in isolation is no longer a topic of any conversations. What is though, is addressing the ‘are we affected’ question on VERY REGULAR BASIS? To help with that, advanced Asset inventories covering hardware, software, *aaS, SBOMs, packages, all aiming at exposure assessment and potential containment & closing communication loops are a MUST. It’s no longer a strictly speaking, a technical problem. It’s a problem that has a stage, that stage is not only political, but also visionary. Whoever does the minimum effort to collect and maintain the best asset inventory, then predict, plan to contain, and finally close, will be the winner of the many brownie points to be distributed in this area in the future.

And that’s why the future belongs to TRIAGE function.

That Omelas child, the punchbag, the scapegoat. The first line of defense, and the most important. Yet so often neglected.

Clear SOPs for Triage will help to handle most of incoming ‘requests’. You want that Triage team to be supported as hell. Their procedures must be simple, to the point, and with clear paths of both closure and escalation. Such triage function will train the best IR practitioners of the future. Jacks of all trades, outspoken, cooperative, and assertive.

The game is changing and we need to adapt. It’s time you take your Triage team out for a good dinner.

Dealing with alert fatigue, Part 2

In the first part of this series I found myself jumping from one topic to another. I will do so in part 2, too 🙂

Dealing with alert fatigue requires a focused, multipronged approach:

  • streamline submissions of ‘possible incidents’ (reported via phone, IM, social media, OSINT feed, peer group, vendor escalation, email to a known security distro, email to a known ‘security’ person, email to a random person that maybe knows a ‘security’ person, submission via a random web form set up 10 years ago, and so on and so forth)
    • you want a single place of submission! not 20 different, disorganized places!
  • gather all incoming ‘workable’ items in one place (ticketing system, system of record, Bob’s mailbox) & help submitters as much as you can; note that these submitters, most of the time, don’t have a clue and simply need some reassurance
  • sort these incoming alerts out: classify them, prioritize them, own their closures and their respective SLAs; assign handling to different groups depending on the classification and prioritization, f.ex.:
    • junior vs senior handler
    • triage vs. analysis/investigations team
    • time-sensitive vs. non-time sensitive
    • global vs. local impact
    • internal or external scope
    • customer-related or not
    • shared responsibility model or not
    • etc.
  • generate metrics to at least know what you are dealing with
    • you want to ensure that all global, follow-the-Sun parties involved contribute equally (no tickets cherry-picking, kicking the can to next regions, plus holidays, special occasions are taken care of, and are accounted for in stats, etc.)
    • you want to ensure tickets are closed within certain SLAs; if you don’t have SLAs, define them
    • check how long it takes to close tickets, their classs… it’s eye opening; TALK TO ANALYSTS A LOT
    • you want to ensure Regulated markets are covered & you have resources to cover them
    • you can use these metrics to see what direction the next step should take; that means: people, process, technology improvements (metrics build a case for you to hire more people or train them, you can improve the processes, you can change/add/remove technology, you can also decommission some tickets that are low priority, etc.)
  • convert all the unstructured ticketing data to a kinda-structured one:
    • whatever the class of the ticket it is, it’s most likely the information preserved in the ticket is not structured; the ticket source is not populating designated fields in the ticket ‘database’, data is not auto-enriched in any way, presentation layer probably sucks as well
    • you want to see it all, and to do so you extract metadata, including but not limited to : who submitted the ticket, where from (ip, device name, device type, account name, user name, owner, resource pool, etc.), why, what are observables, IOCs, URLs, email headers, basically… extract anything that has any meaning whatsoever that could be used to compare, correlate it against the very same data from other tickets
    • you can take snapshots from last 24h, last week, month, year, etc.
    • you put this data it in excel, splunk, whatever, and then you start analysing — you are looking for candidates for auto-closures!
    • you are also looking for items of interest that could be used as a ‘seed’ to further processing, research & pivots to speed up investigations: aforementioned data enrichment can rely on artifacts you extract from the ticket metadata
  • you also want to check FP ratio for every single class of the tickets you have
    • if it’s been always a FP, or always FP in last 3, 6, 12 months, then why is it still a ticketing detection? Can it be converted to some other detection type? dashboard? can it be eye-balled/triaged BEFORE becoming a ticket? are these detections time-sensitive, or can they be processed in a slower mode ?
  • yes, perhaps you may need a second queue, the one for ‘low-fidelity’ detections, the slow, vague stuff, the one you may never really process fully
    • non-time sensitive stuff
    • low-fidelity stuff; the ‘it looks interesting, but not enough to be an actionable triage/investigation item’ type of detections
    • dashboards fit in this space too
    • caveat: you need to measure time spent on it!
  • regular reviews of all the ticket classes is a must:
    • individual analysts won’t do it; they have a sharp focus on the ticket in front of them, but once out of sight (closed) it’s out of their mind (apologies for generalization, some of them WILL pick up some patterns, and this is the type of analyst you want to have on your team — they will help you to beat the queue down!)
    • senior IR people are usually a better choice for analysis like this; they can look at last week’s tickets, listen to handovers, look for patterns, act on them
    • don’t be afraid to exclude – EDRs, AVs, Next Gen AV, proxy, IDS, IPS, WAF, all these logs are full of junk… exclude bad hits early, ideally, make it a part of SOP to add exclusions for every single FP from what is believed to be a ‘high-fidelity’ alert source — typically AV, EDR; you will be surprised how many tickets are NOT created once these exclusions are in place (simple targets may include a subset of devices included in Breach and Attack Simulation tests, phishing exercises, etc.)
    • research your environment… if you get network-based ‘high-fidelity’ alerts hitting on classes of vulnerabilities that do not apply to your environment — exclude them!
    • same goes for OLD stuff… if you get hits on a vuln from 2005 then it is an exclusion
    • every device exposed to the internet is being port scanned, pentested, and massaged by gazillion of good, bad, gray or unknown sources; do not alert on these ‘just because’
    • a lot of activities worthy analysis moved to the endpoint level, even browser level — alerts coming from this level are probably far more important than network level (maybe with the exception of traffic volume monitoring? correct me here, please)
    • if you protect APIs, microservices, *aaS, Cloud, you need to understand the proprietary logs and/or cloud logs offered to you; it’s actually difficult to understand them, they are still in their infancy, and because often there is often no public body of knowledge, you are on your own.. so, if it is in the scope, let your brightest engineers and analysts research that as a priority!
  • look at the RBA (Risk Based Alerting)
    • this is a growing trend, since at 2018 at least (see an excellent presentation by Jim Apger and Stuart McIntosh that imho started it all during Splunk .conf18; pdf warning)
    • instead of alerting on everything, you cherry-pick events, score them, calculate score for the cluster within a certain timeframe, usually per host, per user account, or per the tuple of the two, then you look at the highest score clusters that bubble up to the top
    • it’s still far from being perfect, but it aggregates low-fidelity events into a list of events that are executed within a close temporal proximity, and as a result, are viewed within a certain context
    • IMPACT is still hard to measure, at least in my experience, but I strongly believe THIS IS THE WAY FORWARD
  • look at dedicated tools solving specific classes of problems (tickets)
    • responding to phishing reports can be algorithmically handled by dedicated solutions like Cofense, or what Sublime Security appears to be working on (full disclosure: I don’t have any affiliations with them, it’s just two companies I know of that try to solve the problem in this space)
    • to solve other problems that are well-known across industry and shared by many orgs, just buy whatever solves it, even if imperfectly – building your own solutions for such problems is tempting, but we are past this stage; ‘buy’ is the future for addressing ‘well known problems’ — your vendor will solve it better than you, will have an exposure to data from many clients and will outdo you in many other ways
    • focus your ‘build’ efforts on handling tickets / incidents related to your internal, non-public events… let it be API-related, pre-auth and post-auth activities attempting to abuse your services, etc. – this is because no one is going to build it better than you
  • spend some time reviewing SOPs…
    • a good SOP, I mean, the one that is clearly stating what closure and escalation criteria are – yes, that is, when it’s a clearly written instruction it helps you to delegate a lot of basic work to junior analysts; mind you, junior analysts need instruction as clear as possible, and they need to have a reliable point of escalation; the result will surprise you – less time spent on tickets in general, faster involvement of more senior people, and a much faster career progression and better morale from junior people — they will not hate the queue, they will become the agents of change if they are empowered enough

You may say it’s a nice list, full of vague statements, and patronizing attempts to sound smart hidden under umbrella of edification efforts, but hear me out a bit more…

Set up your first meeting with your senior engineers / analysts / incident commanders. Do metrics. Do them every week. What to do next after that… it will naturally emerge from these discussions. You can only change things that you actually look at, and understand. Your queue maturity is what Antoine de Saint-ExupĂ©ry clichĂ© quote speaks about: Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

Kill your tickets. Now.

Posted in SOC