High Fidelity detections are Low Fidelity detections, until proven otherwise

A few days ago Nas kicked off an interesting discussion on Xitter about detections’ quality. I liked it, so I offered my personal insight. I then added a stupid example to illustrate my point to which DylanInfosec replied:

Would love to set some time aside and gather some OS log dumps, throw em in a SIEM and test that way or something. I guess crowd validation with a trusted diverse group could work too. Not-for-profit or anything but just to share with the community

This made me think…

I am an old-school data hoarder; as far as I remember I have always been actively looking for data of interest in a lot of places… And I must confess that the only reason I could immediately provide that stupid mimi-based regex filename search example was because I had an access to my private ‘clean’ file names dataset…

You see… over a decade ago I kicked off a personal project of mine that focused on collecting software data from CLEAN sources. While many people in the cybersecurity industry at that time primarily focused on malware collections, I decided to take a step forward and collect data that was most likely clean. So, I wrote a number of web scrapers, downloaders, used VPN and Tor where necessary and eventually built a large data set of samples that is a a collection of (most likely) clean files downloaded from publicly available sources. I didn’t stop there. I took every single sample that I downloaded and got it decompiled, whenever it was possible… then processed all the decompiled files only to build a modern, full-blown, Windows-centric clean software data collection set that I believed at that time to be far better than NIST’s.

Now, it’s been a few years and this set is getting older and older, every single day, so perhaps it’s time for it to win some brownie points in the community…

Many of our threat hunting rules depend on file names. The file I am attaching to this post includes a list of many PE file names in my collection that are known to be ‘clean’ (to be precise, these are all file names ending with the following file extensions: ‘exe’, ‘dll’, ‘drv’, ‘ocx’, ‘sys’). It goes without saying that you must treat this list as very suspicious, but I hope it will help you to write better detections…

_files_of_interest.su.zip

And to illustrate the point, let’s run a query that is similar to the one I did for my tweet:

rg -i "mimi.*?\.(dll|exe|sys)" _files_of_interest.su

Note: you can’t use the _files_of_interest.zip/_files_of_interest.su files for commercial purposes.

Adding character(s) to Command Line processing

In my old post about certutil I mentioned that it accepts a number of less-known Unicode characters passed to its command line. Powershell accepting a number of Unicode characters representing “-” and its variations is a very well-known fact too.

What’s new? You may ask…

Processing command line was never easy. All Operating Systems, their various shells as well as many command line tools come with their own command line parsing ideas and quirks, but, I bet, whoever designed many of these command line argument parsers didn’t really see the Unicode character set coming…

In recent years we moved away from a simple world of “-“, “–“, and “/” as command/options switches towards the world that is well… kinda developing now.

In 2024 we have a number of popular Windows programs that accept a lot of Unicode characters as ‘special’ (either as a part of a command line, or ‘pasted’ to the program):

  • \t (Unicode 0x0009) – <Character Tabulation> (HT, TAB) // \t needs to be interpreted
  • \n (Unicode 0x000A) – (EOL, LF, NL) // \n needs to be interpreted
  • \r (Unicode 0x000D) – <Carriage Return> (CR) // \r needs to be interpreted
  • ” ” (Unicode 0x0020) – Space (SP) // ignore quotes
  • ” (Unicode 0x0022) – Quotation Mark
  • ‘ (Unicode 0x0027) – Apostrophe
  • – (Unicode 0x002D) – Hyphen-Minus
  • / (Unicode 0x002F) – Solidus, slash, forward slash
  • – (Unicode 0x0096 – mapped to 0xFB in codepage 437)
  • ” ” (Unicode 0x00A0) – No-Break Space (NBSP) // ignore quotes
  • (Unicode 0x2013) – En Dash
  • (Unicode 0x2014) – Em Dash
  • (Unicode 0x201C) – Left Double Quotation Mark
  • (Unicode 0x201D) – Right Double Quotation Mark
  • “ ” (Unicode 0x202F) – Narrow No-Break Space (NNBSP) // ignore quotes
  • (Unicode 0x2212) – Minus Sign
  • and possibly more

While not all programs accept these yet, we can already list a few that actually do:

  • certutil.exe
  • powershell.exe
  • pwsh.exe
  • certreq.exe
  • conhost.exe

You may ask… what’s a big deal?

Well, the big deal is that many assumptions about how command line arguments are passed to programs shaped the whole industry obsessively focused on detection engineering fixated on “recognizable command line patterns”.

These Unicode characters break a lot of these assumptions…