Dealing with alert fatigue, Part 2

October 8, 2022 in SOC

In the first part of this series I found myself jumping from one topic to another. I will do so in part 2, too 🙂

Dealing with alert fatigue requires a focused, multipronged approach:

  • streamline submissions of ‘possible incidents’ (reported via phone, IM, social media, OSINT feed, peer group, vendor escalation, email to a known security distro, email to a known ‘security’ person, email to a random person that maybe knows a ‘security’ person, submission via a random web form set up 10 years ago, and so on and so forth)
    • you want a single place of submission! not 20 different, disorganized places!
  • gather all incoming ‘workable’ items in one place (ticketing system, system of record, Bob’s mailbox) & help submitters as much as you can; note that these submitters, most of the time, don’t have a clue and simply need some reassurance
  • sort these incoming alerts out: classify them, prioritize them, own their closures and their respective SLAs; assign handling to different groups depending on the classification and prioritization, f.ex.:
    • junior vs senior handler
    • triage vs. analysis/investigations team
    • time-sensitive vs. non-time sensitive
    • global vs. local impact
    • internal or external scope
    • customer-related or not
    • shared responsibility model or not
    • etc.
  • generate metrics to at least know what you are dealing with
    • you want to ensure that all global, follow-the-Sun parties involved contribute equally (no tickets cherry-picking, kicking the can to next regions, plus holidays, special occasions are taken care of, and are accounted for in stats, etc.)
    • you want to ensure tickets are closed within certain SLAs; if you don’t have SLAs, define them
    • check how long it takes to close tickets, their classs… it’s eye opening; TALK TO ANALYSTS A LOT
    • you want to ensure Regulated markets are covered & you have resources to cover them
    • you can use these metrics to see what direction the next step should take; that means: people, process, technology improvements (metrics build a case for you to hire more people or train them, you can improve the processes, you can change/add/remove technology, you can also decommission some tickets that are low priority, etc.)
  • convert all the unstructured ticketing data to a kinda-structured one:
    • whatever the class of the ticket it is, it’s most likely the information preserved in the ticket is not structured; the ticket source is not populating designated fields in the ticket ‘database’, data is not auto-enriched in any way, presentation layer probably sucks as well
    • you want to see it all, and to do so you extract metadata, including but not limited to : who submitted the ticket, where from (ip, device name, device type, account name, user name, owner, resource pool, etc.), why, what are observables, IOCs, URLs, email headers, basically… extract anything that has any meaning whatsoever that could be used to compare, correlate it against the very same data from other tickets
    • you can take snapshots from last 24h, last week, month, year, etc.
    • you put this data it in excel, splunk, whatever, and then you start analysing — you are looking for candidates for auto-closures!
    • you are also looking for items of interest that could be used as a ‘seed’ to further processing, research & pivots to speed up investigations: aforementioned data enrichment can rely on artifacts you extract from the ticket metadata
  • you also want to check FP ratio for every single class of the tickets you have
    • if it’s been always a FP, or always FP in last 3, 6, 12 months, then why is it still a ticketing detection? Can it be converted to some other detection type? dashboard? can it be eye-balled/triaged BEFORE becoming a ticket? are these detections time-sensitive, or can they be processed in a slower mode ?
  • yes, perhaps you may need a second queue, the one for ‘low-fidelity’ detections, the slow, vague stuff, the one you may never really process fully
    • non-time sensitive stuff
    • low-fidelity stuff; the ‘it looks interesting, but not enough to be an actionable triage/investigation item’ type of detections
    • dashboards fit in this space too
    • caveat: you need to measure time spent on it!
  • regular reviews of all the ticket classes is a must:
    • individual analysts won’t do it; they have a sharp focus on the ticket in front of them, but once out of sight (closed) it’s out of their mind (apologies for generalization, some of them WILL pick up some patterns, and this is the type of analyst you want to have on your team — they will help you to beat the queue down!)
    • senior IR people are usually a better choice for analysis like this; they can look at last week’s tickets, listen to handovers, look for patterns, act on them
    • don’t be afraid to exclude – EDRs, AVs, Next Gen AV, proxy, IDS, IPS, WAF, all these logs are full of junk… exclude bad hits early, ideally, make it a part of SOP to add exclusions for every single FP from what is believed to be a ‘high-fidelity’ alert source — typically AV, EDR; you will be surprised how many tickets are NOT created once these exclusions are in place (simple targets may include a subset of devices included in Breach and Attack Simulation tests, phishing exercises, etc.)
    • research your environment… if you get network-based ‘high-fidelity’ alerts hitting on classes of vulnerabilities that do not apply to your environment — exclude them!
    • same goes for OLD stuff… if you get hits on a vuln from 2005 then it is an exclusion
    • every device exposed to the internet is being port scanned, pentested, and massaged by gazillion of good, bad, gray or unknown sources; do not alert on these ‘just because’
    • a lot of activities worthy analysis moved to the endpoint level, even browser level — alerts coming from this level are probably far more important than network level (maybe with the exception of traffic volume monitoring? correct me here, please)
    • if you protect APIs, microservices, *aaS, Cloud, you need to understand the proprietary logs and/or cloud logs offered to you; it’s actually difficult to understand them, they are still in their infancy, and because often there is often no public body of knowledge, you are on your own.. so, if it is in the scope, let your brightest engineers and analysts research that as a priority!
  • look at the RBA (Risk Based Alerting)
    • this is a growing trend, since at 2018 at least (see an excellent presentation by Jim Apger and Stuart McIntosh that imho started it all during Splunk .conf18; pdf warning)
    • instead of alerting on everything, you cherry-pick events, score them, calculate score for the cluster within a certain timeframe, usually per host, per user account, or per the tuple of the two, then you look at the highest score clusters that bubble up to the top
    • it’s still far from being perfect, but it aggregates low-fidelity events into a list of events that are executed within a close temporal proximity, and as a result, are viewed within a certain context
    • IMPACT is still hard to measure, at least in my experience, but I strongly believe THIS IS THE WAY FORWARD
  • look at dedicated tools solving specific classes of problems (tickets)
    • responding to phishing reports can be algorithmically handled by dedicated solutions like Cofense, or what Sublime Security appears to be working on (full disclosure: I don’t have any affiliations with them, it’s just two companies I know of that try to solve the problem in this space)
    • to solve other problems that are well-known across industry and shared by many orgs, just buy whatever solves it, even if imperfectly – building your own solutions for such problems is tempting, but we are past this stage; ‘buy’ is the future for addressing ‘well known problems’ — your vendor will solve it better than you, will have an exposure to data from many clients and will outdo you in many other ways
    • focus your ‘build’ efforts on handling tickets / incidents related to your internal, non-public events… let it be API-related, pre-auth and post-auth activities attempting to abuse your services, etc. – this is because no one is going to build it better than you
  • spend some time reviewing SOPs…
    • a good SOP, I mean, the one that is clearly stating what closure and escalation criteria are – yes, that is, when it’s a clearly written instruction it helps you to delegate a lot of basic work to junior analysts; mind you, junior analysts need instruction as clear as possible, and they need to have a reliable point of escalation; the result will surprise you – less time spent on tickets in general, faster involvement of more senior people, and a much faster career progression and better morale from junior people — they will not hate the queue, they will become the agents of change if they are empowered enough

You may say it’s a nice list, full of vague statements, and patronizing attempts to sound smart hidden under umbrella of edification efforts, but hear me out a bit more…

Set up your first meeting with your senior engineers / analysts / incident commanders. Do metrics. Do them every week. What to do next after that… it will naturally emerge from these discussions. You can only change things that you actually look at, and understand. Your queue maturity is what Antoine de Saint-ExupĂ©ry clichĂ© quote speaks about: Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

Kill your tickets. Now.

Comments are closed.