Dealing with alert fatigue, Part 2

In the first part of this series I found myself jumping from one topic to another. I will do so in part 2, too 🙂

Dealing with alert fatigue requires a focused, multipronged approach:

  • streamline submissions of ‘possible incidents’ (reported via phone, IM, social media, OSINT feed, peer group, vendor escalation, email to a known security distro, email to a known ‘security’ person, email to a random person that maybe knows a ‘security’ person, submission via a random web form set up 10 years ago, and so on and so forth)
    • you want a single place of submission! not 20 different, disorganized places!
  • gather all incoming ‘workable’ items in one place (ticketing system, system of record, Bob’s mailbox) & help submitters as much as you can; note that these submitters, most of the time, don’t have a clue and simply need some reassurance
  • sort these incoming alerts out: classify them, prioritize them, own their closures and their respective SLAs; assign handling to different groups depending on the classification and prioritization, f.ex.:
    • junior vs senior handler
    • triage vs. analysis/investigations team
    • time-sensitive vs. non-time sensitive
    • global vs. local impact
    • internal or external scope
    • customer-related or not
    • shared responsibility model or not
    • etc.
  • generate metrics to at least know what you are dealing with
    • you want to ensure that all global, follow-the-Sun parties involved contribute equally (no tickets cherry-picking, kicking the can to next regions, plus holidays, special occasions are taken care of, and are accounted for in stats, etc.)
    • you want to ensure tickets are closed within certain SLAs; if you don’t have SLAs, define them
    • check how long it takes to close tickets, their classs… it’s eye opening; TALK TO ANALYSTS A LOT
    • you want to ensure Regulated markets are covered & you have resources to cover them
    • you can use these metrics to see what direction the next step should take; that means: people, process, technology improvements (metrics build a case for you to hire more people or train them, you can improve the processes, you can change/add/remove technology, you can also decommission some tickets that are low priority, etc.)
  • convert all the unstructured ticketing data to a kinda-structured one:
    • whatever the class of the ticket it is, it’s most likely the information preserved in the ticket is not structured; the ticket source is not populating designated fields in the ticket ‘database’, data is not auto-enriched in any way, presentation layer probably sucks as well
    • you want to see it all, and to do so you extract metadata, including but not limited to : who submitted the ticket, where from (ip, device name, device type, account name, user name, owner, resource pool, etc.), why, what are observables, IOCs, URLs, email headers, basically… extract anything that has any meaning whatsoever that could be used to compare, correlate it against the very same data from other tickets
    • you can take snapshots from last 24h, last week, month, year, etc.
    • you put this data it in excel, splunk, whatever, and then you start analysing — you are looking for candidates for auto-closures!
    • you are also looking for items of interest that could be used as a ‘seed’ to further processing, research & pivots to speed up investigations: aforementioned data enrichment can rely on artifacts you extract from the ticket metadata
  • you also want to check FP ratio for every single class of the tickets you have
    • if it’s been always a FP, or always FP in last 3, 6, 12 months, then why is it still a ticketing detection? Can it be converted to some other detection type? dashboard? can it be eye-balled/triaged BEFORE becoming a ticket? are these detections time-sensitive, or can they be processed in a slower mode ?
  • yes, perhaps you may need a second queue, the one for ‘low-fidelity’ detections, the slow, vague stuff, the one you may never really process fully
    • non-time sensitive stuff
    • low-fidelity stuff; the ‘it looks interesting, but not enough to be an actionable triage/investigation item’ type of detections
    • dashboards fit in this space too
    • caveat: you need to measure time spent on it!
  • regular reviews of all the ticket classes is a must:
    • individual analysts won’t do it; they have a sharp focus on the ticket in front of them, but once out of sight (closed) it’s out of their mind (apologies for generalization, some of them WILL pick up some patterns, and this is the type of analyst you want to have on your team — they will help you to beat the queue down!)
    • senior IR people are usually a better choice for analysis like this; they can look at last week’s tickets, listen to handovers, look for patterns, act on them
    • don’t be afraid to exclude – EDRs, AVs, Next Gen AV, proxy, IDS, IPS, WAF, all these logs are full of junk… exclude bad hits early, ideally, make it a part of SOP to add exclusions for every single FP from what is believed to be a ‘high-fidelity’ alert source — typically AV, EDR; you will be surprised how many tickets are NOT created once these exclusions are in place (simple targets may include a subset of devices included in Breach and Attack Simulation tests, phishing exercises, etc.)
    • research your environment… if you get network-based ‘high-fidelity’ alerts hitting on classes of vulnerabilities that do not apply to your environment — exclude them!
    • same goes for OLD stuff… if you get hits on a vuln from 2005 then it is an exclusion
    • every device exposed to the internet is being port scanned, pentested, and massaged by gazillion of good, bad, gray or unknown sources; do not alert on these ‘just because’
    • a lot of activities worthy analysis moved to the endpoint level, even browser level — alerts coming from this level are probably far more important than network level (maybe with the exception of traffic volume monitoring? correct me here, please)
    • if you protect APIs, microservices, *aaS, Cloud, you need to understand the proprietary logs and/or cloud logs offered to you; it’s actually difficult to understand them, they are still in their infancy, and because often there is often no public body of knowledge, you are on your own.. so, if it is in the scope, let your brightest engineers and analysts research that as a priority!
  • look at the RBA (Risk Based Alerting)
    • this is a growing trend, since at 2018 at least (see an excellent presentation by Jim Apger and Stuart McIntosh that imho started it all during Splunk .conf18; pdf warning)
    • instead of alerting on everything, you cherry-pick events, score them, calculate score for the cluster within a certain timeframe, usually per host, per user account, or per the tuple of the two, then you look at the highest score clusters that bubble up to the top
    • it’s still far from being perfect, but it aggregates low-fidelity events into a list of events that are executed within a close temporal proximity, and as a result, are viewed within a certain context
    • IMPACT is still hard to measure, at least in my experience, but I strongly believe THIS IS THE WAY FORWARD
  • look at dedicated tools solving specific classes of problems (tickets)
    • responding to phishing reports can be algorithmically handled by dedicated solutions like Cofense, or what Sublime Security appears to be working on (full disclosure: I don’t have any affiliations with them, it’s just two companies I know of that try to solve the problem in this space)
    • to solve other problems that are well-known across industry and shared by many orgs, just buy whatever solves it, even if imperfectly – building your own solutions for such problems is tempting, but we are past this stage; ‘buy’ is the future for addressing ‘well known problems’ — your vendor will solve it better than you, will have an exposure to data from many clients and will outdo you in many other ways
    • focus your ‘build’ efforts on handling tickets / incidents related to your internal, non-public events… let it be API-related, pre-auth and post-auth activities attempting to abuse your services, etc. – this is because no one is going to build it better than you
  • spend some time reviewing SOPs…
    • a good SOP, I mean, the one that is clearly stating what closure and escalation criteria are – yes, that is, when it’s a clearly written instruction it helps you to delegate a lot of basic work to junior analysts; mind you, junior analysts need instruction as clear as possible, and they need to have a reliable point of escalation; the result will surprise you – less time spent on tickets in general, faster involvement of more senior people, and a much faster career progression and better morale from junior people — they will not hate the queue, they will become the agents of change if they are empowered enough

You may say it’s a nice list, full of vague statements, and patronizing attempts to sound smart hidden under umbrella of edification efforts, but hear me out a bit more…

Set up your first meeting with your senior engineers / analysts / incident commanders. Do metrics. Do them every week. What to do next after that… it will naturally emerge from these discussions. You can only change things that you actually look at, and understand. Your queue maturity is what Antoine de Saint-ExupĂ©ry clichĂ© quote speaks about: Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

Kill your tickets. Now.

Posted in SOC

Dealing with alert fatigue, Part 1

Gazillion tickets, gazillion emails a day. The business as usual for most SOCs…

It actually doesn’t matter how we got here (although I will cover some bits later on) – what matters is that we ARE here, and it literally sucks the life out of us, and every new dozen of emails/tickets coming in so frequently during the day makes us all die inside more and more…

This series will solve many problems for you. Okay, not really, but it will give you ideas and tools that you can use to make the life easier for yourself and your team. And if you do need help, ping me & I will be happy to help.

Okay, so a quick background first: over last 20 years I have worked many queues: localization bugs, analysis of malware samples from customers, analysis of sandbox samples to improve sandbox engine quality, as well as tones of IR emails and tickets that I had to work on one by one… I worked on call, covered follow the Sun queue processing, and led teams doing so. In every org I worked for I made it a priority to reduce ‘the stupid’.

What does it mean in practice?

In all these cases I always tried to make the life for everyone involved easier. There is an old Sales saying… always be closing. For years, I’ve been adapting this line like a mantra to my queue work and tried my best to follow it to make sure I am the most effective ‘closer’ w/o affecting the security posture of the org I worked for. As an analyst, I used macro tools and bookmarklets to fill-in the ticketing systems with preset values, wrote some scripts that helped me to do quick light forensics in a pre-EDR era, then when things got a bit modernized – looked at SOAP/REST to move the ticket closing function to CLI tools (someone smarter than me actually implemented it on our team), then looked very carefully at Threat Hunting and Detection Engineering efforts with an intention to tune them down, and in recent years embraced automation to find ways to auto-close as many tickets as possible.

Always having 2 goals in mind: less tickets, and less False Positives.

And as a manager I always tried to kill alerts and emails at source. Disable, decommission, phase out, if possible. And if not, tune, adjust, add exclusions, or just reduce the audience. Not everyone needs to get emails about everything. At any point of time — focus is on a ‘big picture’, trend analysis, aiming to recognize patterns and adjust, give people working for you the precious gift of… time.

Before you start thinking that I am claiming credit for other peoples’ work and introduce ideas as mine, I want to humbly offer words of appreciation for every single person I ever worked with that had far better ideas than mine, often implemented POC that was sometimes industry-level groundbreaking, and in some cases made my jaws actually drop (‘why didn’t I think of it before!’). The intention of this series is not to self-ingratiate, but to encourage everyone reading it to go out there and simply try to make a difference…

This is the moment where I have to say that I have a bit of a problem with Detection Engineering and Threat Hunting. I am actually part of the problem, because as a hunter I have developed many imperfect detection rules, SPL queries, yara sigs, etc. and while being obviously very proud of them and often assumed that the results of using them are easy to assess and are kinda ‘obvious’ to work with, I was adding fuel to the fire. This is an arrogant, selfish and short-sighted approach that doesn’t translate well into a world of SOPs…

Why?

We need to clearly state that there is a distinctive difference between ‘test’ or ‘beta’ or literally ‘FP-prone’ detections we think are ‘good to eyeball or to share on Twitter/github’ vs. actual, production-ready, actionable detections for which we want to create tickets/incidents that we ask SOC to work on. Investigations cost. You do want to cherry-pick occurrences where this work needs to be done. If your detection query/threat hunting query has too many hits, it’s not a good query.

Let me step back a bit more: we need to clearly state there is a distinctive difference between time-sensitive, and non-time-sensitive work. A lot of DE/TH work is non-time sensitive, while SOC/CERT groups work under strict SLAs. The less time-sensitive work for them, the better. How to facilitate that? Give them … less alerts.

I am jumping from one topic to another, but it’s a complex, loaded issue.

Let’s jump a bit more… ask yourself this: do you work all your incidents via emails? using your ticketing system (system of record) ? or, both?

In 2022, you MUST work all your incidents from a ticketing system.

That means that ANY incident coming in via email, IM, phone call, twitter message, etc. MUST be always put in a system of record first. Any emails that MAY be incident-related need to be disabled, phased out, and redirected to the ticketing system one way or another. There should be literally NO incident-related email to handle, apart from some random, ad-hoc, one-off situations (f.ex. CISO emailing another CISO which cascades down to SOC, but even these should find their place in a ticketing system)…

Let me reiterate: all the incident work should be done from a system of record/ticketing system, and analysts should have their mailboxes as clean as possible.

System of record brings order and accountability to the work we do.

Email hygiene removes the clutter from analysts inboxes and reduces a number of daily context switches.

Mind you, we are still talking alert fatigue REDUCTION. This actually means constant MONITORING aka TELEMETRY. You need to build some metrics around the work that is done on DAILY basis. And if you are a manager, you need to spend time reviewing that work. This will help to improve SOPs, this will help to improve technical writing/reporting skills of the analysts, this will highlight the ‘stupid’ or ‘unnecessary’ or ‘borderline illegal’ people actually do (and they do, f.ex.: an external SOC member noticed our org was port-scanned from a distinctive IP; he launched his own port scan against that IP; yes, facepalm).

What do you know about the origin of all alerts you see on your queue?

Many of them are ‘legacy’. Someone, somewhere, sometime in the past decided that particular class of tickets is a MUST on the queue. Was it a result of contractual, regulatory, compliance/audit need? Was this detection added 10 years ago and is no longer important? Was that a whimsical ego-driven, ‘wide’ detection added by ‘the most technical primadonna on the team’?

ROI is always the key.

The quality of alerting differs. Some are simply high-fidelity, some are low-fidelity, but a success would have a tremendous impact on the org. Lots of parameters to think of. But… from a pragmatic angle… if you receive alerts on your queue that only produced 100% FPs in last year you do need to start asking yourself: why are we looking at them AS ALERTS at all?

The alert fatigue is not a problem of one person. It is a problem of the whole team. If you work in SOC or on a triage team, keep your eyes open, look for patterns, escalate ‘stupid’.

And it may be reasonable to create two queues. Urgent, time-sensitive, SLA-driven work & the ‘maybes’, one that requires additional knowledge, some poking around, ‘wider in scope’, and NON-TIME SENSITIVE.

There is no way to optimize your queue w/o your active participation. We live at the time when holacracy seems to be gaining popularity, but reality is — we still need TECHNICAL decision makers who can make a lot of alerts go away. This activity may sometimes be perceived as brutal, ruthless, and may hurt feelings of some people (‘infosec rockstars’, ‘team primadonnas’), but ROI is out there to be your beacon and an excuse…

Cut it down, if you can. But fundamentally, start talking about it within your teams. The alert fatigue is a human-made problem. I guarantee you that you can cut down 50% of your alerts with a minimal penalty. How do I know that? I’ve done it before.

Posted in SOC