Dealing with alert fatigue, Part 1

Gazillion tickets, gazillion emails a day. The business as usual for most SOCs…

It actually doesn’t matter how we got here (although I will cover some bits later on) – what matters is that we ARE here, and it literally sucks the life out of us, and every new dozen of emails/tickets coming in so frequently during the day makes us all die inside more and more…

This series will solve many problems for you. Okay, not really, but it will give you ideas and tools that you can use to make the life easier for yourself and your team. And if you do need help, ping me & I will be happy to help.

Okay, so a quick background first: over last 20 years I have worked many queues: localization bugs, analysis of malware samples from customers, analysis of sandbox samples to improve sandbox engine quality, as well as tones of IR emails and tickets that I had to work on one by one… I worked on call, covered follow the Sun queue processing, and led teams doing so. In every org I worked for I made it a priority to reduce ‘the stupid’.

What does it mean in practice?

In all these cases I always tried to make the life for everyone involved easier. There is an old Sales saying… always be closing. For years, I’ve been adapting this line like a mantra to my queue work and tried my best to follow it to make sure I am the most effective ‘closer’ w/o affecting the security posture of the org I worked for. As an analyst, I used macro tools and bookmarklets to fill-in the ticketing systems with preset values, wrote some scripts that helped me to do quick light forensics in a pre-EDR era, then when things got a bit modernized – looked at SOAP/REST to move the ticket closing function to CLI tools (someone smarter than me actually implemented it on our team), then looked very carefully at Threat Hunting and Detection Engineering efforts with an intention to tune them down, and in recent years embraced automation to find ways to auto-close as many tickets as possible.

Always having 2 goals in mind: less tickets, and less False Positives.

And as a manager I always tried to kill alerts and emails at source. Disable, decommission, phase out, if possible. And if not, tune, adjust, add exclusions, or just reduce the audience. Not everyone needs to get emails about everything. At any point of time — focus is on a ‘big picture’, trend analysis, aiming to recognize patterns and adjust, give people working for you the precious gift of… time.

Before you start thinking that I am claiming credit for other peoples’ work and introduce ideas as mine, I want to humbly offer words of appreciation for every single person I ever worked with that had far better ideas than mine, often implemented POC that was sometimes industry-level groundbreaking, and in some cases made my jaws actually drop (‘why didn’t I think of it before!’). The intention of this series is not to self-ingratiate, but to encourage everyone reading it to go out there and simply try to make a difference…

This is the moment where I have to say that I have a bit of a problem with Detection Engineering and Threat Hunting. I am actually part of the problem, because as a hunter I have developed many imperfect detection rules, SPL queries, yara sigs, etc. and while being obviously very proud of them and often assumed that the results of using them are easy to assess and are kinda ‘obvious’ to work with, I was adding fuel to the fire. This is an arrogant, selfish and short-sighted approach that doesn’t translate well into a world of SOPs…

Why?

We need to clearly state that there is a distinctive difference between ‘test’ or ‘beta’ or literally ‘FP-prone’ detections we think are ‘good to eyeball or to share on Twitter/github’ vs. actual, production-ready, actionable detections for which we want to create tickets/incidents that we ask SOC to work on. Investigations cost. You do want to cherry-pick occurrences where this work needs to be done. If your detection query/threat hunting query has too many hits, it’s not a good query.

Let me step back a bit more: we need to clearly state there is a distinctive difference between time-sensitive, and non-time-sensitive work. A lot of DE/TH work is non-time sensitive, while SOC/CERT groups work under strict SLAs. The less time-sensitive work for them, the better. How to facilitate that? Give them … less alerts.

I am jumping from one topic to another, but it’s a complex, loaded issue.

Let’s jump a bit more… ask yourself this: do you work all your incidents via emails? using your ticketing system (system of record) ? or, both?

In 2022, you MUST work all your incidents from a ticketing system.

That means that ANY incident coming in via email, IM, phone call, twitter message, etc. MUST be always put in a system of record first. Any emails that MAY be incident-related need to be disabled, phased out, and redirected to the ticketing system one way or another. There should be literally NO incident-related email to handle, apart from some random, ad-hoc, one-off situations (f.ex. CISO emailing another CISO which cascades down to SOC, but even these should find their place in a ticketing system)…

Let me reiterate: all the incident work should be done from a system of record/ticketing system, and analysts should have their mailboxes as clean as possible.

System of record brings order and accountability to the work we do.

Email hygiene removes the clutter from analysts inboxes and reduces a number of daily context switches.

Mind you, we are still talking alert fatigue REDUCTION. This actually means constant MONITORING aka TELEMETRY. You need to build some metrics around the work that is done on DAILY basis. And if you are a manager, you need to spend time reviewing that work. This will help to improve SOPs, this will help to improve technical writing/reporting skills of the analysts, this will highlight the ‘stupid’ or ‘unnecessary’ or ‘borderline illegal’ people actually do (and they do, f.ex.: an external SOC member noticed our org was port-scanned from a distinctive IP; he launched his own port scan against that IP; yes, facepalm).

What do you know about the origin of all alerts you see on your queue?

Many of them are ‘legacy’. Someone, somewhere, sometime in the past decided that particular class of tickets is a MUST on the queue. Was it a result of contractual, regulatory, compliance/audit need? Was this detection added 10 years ago and is no longer important? Was that a whimsical ego-driven, ‘wide’ detection added by ‘the most technical primadonna on the team’?

ROI is always the key.

The quality of alerting differs. Some are simply high-fidelity, some are low-fidelity, but a success would have a tremendous impact on the org. Lots of parameters to think of. But… from a pragmatic angle… if you receive alerts on your queue that only produced 100% FPs in last year you do need to start asking yourself: why are we looking at them AS ALERTS at all?

The alert fatigue is not a problem of one person. It is a problem of the whole team. If you work in SOC or on a triage team, keep your eyes open, look for patterns, escalate ‘stupid’.

And it may be reasonable to create two queues. Urgent, time-sensitive, SLA-driven work & the ‘maybes’, one that requires additional knowledge, some poking around, ‘wider in scope’, and NON-TIME SENSITIVE.

There is no way to optimize your queue w/o your active participation. We live at the time when holacracy seems to be gaining popularity, but reality is — we still need TECHNICAL decision makers who can make a lot of alerts go away. This activity may sometimes be perceived as brutal, ruthless, and may hurt feelings of some people (‘infosec rockstars’, ‘team primadonnas’), but ROI is out there to be your beacon and an excuse…

Cut it down, if you can. But fundamentally, start talking about it within your teams. The alert fatigue is a human-made problem. I guarantee you that you can cut down 50% of your alerts with a minimal penalty. How do I know that? I’ve done it before.

Posted in SOC

Inserting data into other processes’ address space, part 1a

I never thought I will write the part 1a of my old post, but here it is.

As usual, I have not explored the below topic in-depth, but have certainly noticed the opportunities and since this is how many interesting developments start, I guess it is still worth … talking…

How do we copy data between processes?

In my old post I have listed a number of inter-process data exchange ideas, but I missed the one that I believe is the most important — at least in 2022 — the non-native stuff. And by that, I mean all these proprietary mechanisms of data exchange that have been developed over the years by vendors different than Microsoft. Many, of course, utilizing the core components of Windows OS, and the very same inter-process communication and cross-process access API functions. Being the ‘genuine’ software and all that, I bet it had to somehow pop up on the radar, and the be filtered out, with time by the likes of AV, EDR, and any other ‘watchmen’… cuz it’s genuine. It’s a stretch, of course, but to their credit, security solutions are getting better and better at detecting any sort of trickery…

With that in mind I started poking around DLLs of known vendors.

I soon discovered a DLL from NVidia (NvIFR.dll) that offers a particular set of exported functions:

  • NvIFR_ConnectToCrossProcessSharedSurfaceEXT
  • NvIFR_CopyFromCrossProcessSharedSurfaceEXT
  • NvIFR_CopyFromSharedSurfaceEXT
  • NvIFR_CopyToCrossProcessSharedSurfaceEXT
  • NvIFR_CopyToSharedSurfaceEXT
  • NvIFR_Create
  • NvIFR_CreateCrossProcessSharedSurfaceEXT
  • NvIFR_CreateEx
  • NvIFR_CreateSharedSurfaceEXT
  • NvIFR_DestroyCrossProcessSharedSurfaceEXT
  • NvIFR_DestroySharedSurfaceEXT
  • NvIFR_GetSDKVersion

hmmm Cross-Process, Shared, and Connect To, and Copy … that certainly sounds interesting!

I don’t have access to any native Nvidia setup, and I don’t play games, so it’s hard to test what these functions really do :(. Quick google for NvIFR_ConnectToCrossProcessSharedSurfaceEXT brought only this interesting reddit conversation.

Poking around the available code, we can speculate that Cross-process Surface interface seems to be accessible via this pipe:

\.\pipe\NVIFR_CPSS_%lld

After you write “[\x7F” to it, you can read a buffer of 2136 bytes (in the DLL version I looked at). The buffer we can then read will contain a name of the shared section we can now open, map, and … hopefully write to. The analysis of the code that follows is not straightforward, there are other DLLs being loaded, APIs resolved, and the complexities encountered would really benefit from ‘live’ analysis, but… c’est la vie.

And this is where this blurb ends. I know, I know, it’s not much, but imagine the possibilities. We can find more similar pieces of code, legitimate, genuine software, and cross-process data exchange snippets present inside these signed DLLs or executables… we may as well come up with many new ways of bypassing security solutions that might have not been possible in the past…