Dealing with alert fatigue, Part 1

Gazillion tickets, gazillion emails a day. The business as usual for most SOCs…

It actually doesn’t matter how we got here (although I will cover some bits later on) – what matters is that we ARE here, and it literally sucks the life out of us, and every new dozen of emails/tickets coming in so frequently during the day makes us all die inside more and more…

This series will solve many problems for you. Okay, not really, but it will give you ideas and tools that you can use to make the life easier for yourself and your team. And if you do need help, ping me & I will be happy to help.

Okay, so a quick background first: over last 20 years I have worked many queues: localization bugs, analysis of malware samples from customers, analysis of sandbox samples to improve sandbox engine quality, as well as tones of IR emails and tickets that I had to work on one by one… I worked on call, covered follow the Sun queue processing, and led teams doing so. In every org I worked for I made it a priority to reduce ‘the stupid’.

What does it mean in practice?

In all these cases I always tried to make the life for everyone involved easier. There is an old Sales saying… always be closing. For years, I’ve been adapting this line like a mantra to my queue work and tried my best to follow it to make sure I am the most effective ‘closer’ w/o affecting the security posture of the org I worked for. As an analyst, I used macro tools and bookmarklets to fill-in the ticketing systems with preset values, wrote some scripts that helped me to do quick light forensics in a pre-EDR era, then when things got a bit modernized – looked at SOAP/REST to move the ticket closing function to CLI tools (someone smarter than me actually implemented it on our team), then looked very carefully at Threat Hunting and Detection Engineering efforts with an intention to tune them down, and in recent years embraced automation to find ways to auto-close as many tickets as possible.

Always having 2 goals in mind: less tickets, and less False Positives.

And as a manager I always tried to kill alerts and emails at source. Disable, decommission, phase out, if possible. And if not, tune, adjust, add exclusions, or just reduce the audience. Not everyone needs to get emails about everything. At any point of time — focus is on a ‘big picture’, trend analysis, aiming to recognize patterns and adjust, give people working for you the precious gift of… time.

Before you start thinking that I am claiming credit for other peoples’ work and introduce ideas as mine, I want to humbly offer words of appreciation for every single person I ever worked with that had far better ideas than mine, often implemented POC that was sometimes industry-level groundbreaking, and in some cases made my jaws actually drop (‘why didn’t I think of it before!’). The intention of this series is not to self-ingratiate, but to encourage everyone reading it to go out there and simply try to make a difference…

This is the moment where I have to say that I have a bit of a problem with Detection Engineering and Threat Hunting. I am actually part of the problem, because as a hunter I have developed many imperfect detection rules, SPL queries, yara sigs, etc. and while being obviously very proud of them and often assumed that the results of using them are easy to assess and are kinda ‘obvious’ to work with, I was adding fuel to the fire. This is an arrogant, selfish and short-sighted approach that doesn’t translate well into a world of SOPs…

Why?

We need to clearly state that there is a distinctive difference between ‘test’ or ‘beta’ or literally ‘FP-prone’ detections we think are ‘good to eyeball or to share on Twitter/github’ vs. actual, production-ready, actionable detections for which we want to create tickets/incidents that we ask SOC to work on. Investigations cost. You do want to cherry-pick occurrences where this work needs to be done. If your detection query/threat hunting query has too many hits, it’s not a good query.

Let me step back a bit more: we need to clearly state there is a distinctive difference between time-sensitive, and non-time-sensitive work. A lot of DE/TH work is non-time sensitive, while SOC/CERT groups work under strict SLAs. The less time-sensitive work for them, the better. How to facilitate that? Give them … less alerts.

I am jumping from one topic to another, but it’s a complex, loaded issue.

Let’s jump a bit more… ask yourself this: do you work all your incidents via emails? using your ticketing system (system of record) ? or, both?

In 2022, you MUST work all your incidents from a ticketing system.

That means that ANY incident coming in via email, IM, phone call, twitter message, etc. MUST be always put in a system of record first. Any emails that MAY be incident-related need to be disabled, phased out, and redirected to the ticketing system one way or another. There should be literally NO incident-related email to handle, apart from some random, ad-hoc, one-off situations (f.ex. CISO emailing another CISO which cascades down to SOC, but even these should find their place in a ticketing system)…

Let me reiterate: all the incident work should be done from a system of record/ticketing system, and analysts should have their mailboxes as clean as possible.

System of record brings order and accountability to the work we do.

Email hygiene removes the clutter from analysts inboxes and reduces a number of daily context switches.

Mind you, we are still talking alert fatigue REDUCTION. This actually means constant MONITORING aka TELEMETRY. You need to build some metrics around the work that is done on DAILY basis. And if you are a manager, you need to spend time reviewing that work. This will help to improve SOPs, this will help to improve technical writing/reporting skills of the analysts, this will highlight the ‘stupid’ or ‘unnecessary’ or ‘borderline illegal’ people actually do (and they do, f.ex.: an external SOC member noticed our org was port-scanned from a distinctive IP; he launched his own port scan against that IP; yes, facepalm).

What do you know about the origin of all alerts you see on your queue?

Many of them are ‘legacy’. Someone, somewhere, sometime in the past decided that particular class of tickets is a MUST on the queue. Was it a result of contractual, regulatory, compliance/audit need? Was this detection added 10 years ago and is no longer important? Was that a whimsical ego-driven, ‘wide’ detection added by ‘the most technical primadonna on the team’?

ROI is always the key.

The quality of alerting differs. Some are simply high-fidelity, some are low-fidelity, but a success would have a tremendous impact on the org. Lots of parameters to think of. But… from a pragmatic angle… if you receive alerts on your queue that only produced 100% FPs in last year you do need to start asking yourself: why are we looking at them AS ALERTS at all?

The alert fatigue is not a problem of one person. It is a problem of the whole team. If you work in SOC or on a triage team, keep your eyes open, look for patterns, escalate ‘stupid’.

And it may be reasonable to create two queues. Urgent, time-sensitive, SLA-driven work & the ‘maybes’, one that requires additional knowledge, some poking around, ‘wider in scope’, and NON-TIME SENSITIVE.

There is no way to optimize your queue w/o your active participation. We live at the time when holacracy seems to be gaining popularity, but reality is — we still need TECHNICAL decision makers who can make a lot of alerts go away. This activity may sometimes be perceived as brutal, ruthless, and may hurt feelings of some people (‘infosec rockstars’, ‘team primadonnas’), but ROI is out there to be your beacon and an excuse…

Cut it down, if you can. But fundamentally, start talking about it within your teams. The alert fatigue is a human-made problem. I guarantee you that you can cut down 50% of your alerts with a minimal penalty. How do I know that? I’ve done it before.

Hexacorn

Hexacorn

Dealing with alert fatigue, Part 1