Looking for the randomness in the most non-AI/ML way…

Here’s an old-school file name-based research… it is not game changing, it won’t bring any immediate solution, but it’s still worth doing today…

The software we install (focus here is on Windows, as usual) creates a loooot of files, and while many of them seem to be completely random, whimsical in nature, especially with regards to their file names, they do end up forming a corpora of sort… Or, when bundled together, all these file names known to be created for legitimate purposes are a great material for research.

For this post I collected 1.5M executable file names from Windows. They may not be a full set of file names ‘out there’, but it’s enough to play around with….

I then looked at statistics of 2- and 3- and 4-character long infixes (ignoring any non [a-z] characters).

The results are below:

  • How often 2-character long infixes appear in these 1.5M file names: filename_stats_2.txt – as you can see, not very useful…
  • How often 3-character long infixes appear in these 1.5M file names: filename_stats_3.txt – not very useful either…
  • How often 4-character long infixes appear in these 1.5M file names: filename_stats_4.txt – this is better… we definitely can cherry-pick a lot of 4-character long infixes that never appear in the set: filename_stats_4_non-existing.txt

Using the latter, we can create regexes sets:

Using these regexes sets you may actually get better at finding randomly named filenames! You will also find a lot of FPs, of course, but now you have a set of regexes you can tune to your needs…

Can this be used in ML/AI research?

Yes, by all means, but the set of file names used as a base should be a loooot higher and collected in a more meaningful way. One can argue that f.ex. temporary files created by installers could be excluded, we could also exclude file names that are following certain patterns in names (f.ex. starting with a dollar ‘$’, tilde ‘~’, or file names conforming to a pattern ‘<GUID>.exe’), we could reduce the corpora by understanding versioned file names (f.ex. ‘FirefoxSetup63.exe’, ‘FirefoxSetup64.0.2.exe’, etc.), we could ignore non-English file names (‘Менеджер BIM Сервера GRAPHISOFT 19.exe’, ‘联系汉化作者.exe’, etc.) or, artificially created file names that are used by many ‘download/update’ managers (‘ICReinstall_’ as in ‘ICReinstall_any_video_converter.exe’, ‘ICReinstall_driver identifier.exe’, etc.), or … we could also focus entirely on signed installers only as well, or compiled within a certain timeframe f.ex. last decade).

As I said… it is not game changing, it won’t bring any immediate solution, but it’s still worth doing today…

And I will now answer the ‘why’:

– just to understand how hopeless the whole file name-matching idea is!

Who am I? Asking for my file friend: whoami.exe…

There is a lot talk about whoami.exe recently, so here’s one more post about it…

When we talk about whoami.exe we often think of it in ‘atomic’ terms. You run it, and you get the results. But by doing so we assume a lot i.e. we kinda indirectly know that we are talking about the executable located in this place:

  • c:\windows\system32\whoami.exe

Of course, some of us know that there is also a 32-bit version on the 64-bit OS:

  • c:\windows\SysWOW64\whoami.exe

and then a bunch of copies in WinSxS directory (file names are versioned):

  • c:\Windows\WinSxS\amd64_microsoft-windows-whoami_31bf3856ad364e35_10.0.19041.1_none_846d8bda2133af3c\whoami.exe
  • c:\Windows\WinSxS\wow64_microsoft-windows-whoami_31bf3856ad364e35_10.0.19041.1_none_8ec2362c55947137\whoami.exe
  • c:\Windows\WinSxS\amd64_microsoft-windows-whoami_31bf3856ad364e35_10.0.22621.1_none_30124a0a75945900\whoami.exe
  • c:\Windows\WinSxS\wow64_microsoft-windows-whoami_31bf3856ad364e35_10.0.22621.1_none_3a66f45ca9f51afb\whoami.exe

And of course, we can reveal the hard links for each of these tools using fsutil:

  • fsutil.exe hardlink list c:\windows\System32\whoami.exe
  • fsutil.exe hardlink list c:\windows\SysWOW64\whoami.exe

Plus, on Windows Arm, we have:

  • c:\Windows\SysArm32\whoami.exe

and respective WinSxS directory (file names are versioned):

  • c:\Windows\WinSxS\arm64.arm_microsoft-windows-whoami_31bf3856ad364e35_10.0.22598.1_none_d3774312fcf7fb69\whoami.exe
  • c:\Windows\WinSxS\arm64.x86_microsoft-windows-whoami_31bf3856ad364e35_10.0.22598.1_none_d37c245afcf28323\whoami.exe
  • c:\Windows\WinSxS\arm64_microsoft-windows-whoami_31bf3856ad364e35_10.0.22598.1_none_2de72d3c78a075fb\whoami.exe

But there is more…

If you ever installed cygwin, you probably know of:

  • c:\Cygwin\bin\whoami.exe
  • c:\Cygwin64\bin\whoami.exe

There is also GIT for Windows that installs a lot of windows-friendly Unix tools including, yes, you guessed right, whoami.exe:

  • c:\Program Files\Git\usr\bin\whoami.exe

At this stage, you probably are aware that Program Files is a nightmare as it occurs in many architecture-specific forms, and many localized versions.

You must be thinking now – this thing is multiplying quickly and spreading faster than covid!

But this is not THE END. There really is more.

A Pro version of software called System Scheduler installs the following whoami.exe file:

  • c:\Program Files (x86)\SystemScheduler\WhoAmI.exe

It is probably the first ever whoami.exe I have ever seen that shows the user info on GUI – as a message box 🙂

Then comes another contender, a tool called MacroCommanderPro:

  • c:\Program Files (x86)\MacroCommander\Bin\WhoAmI.exe

Yes, it is also GUI-based whoami 🙂

And this is just a tip of an iceberg…

The reason I write about all this is because some people like to say ‘the moment someone runs whoami.exe on one of your systems, this is an indication of early stages of compromise!’. Their confidence is built on ignorance. And yes, they may be right… yeah…but they are often very wrong…

Telemetry we deal with today is rich and useful, but threat hunting – as a discipline – is still in its early, naive stages. It’s healthy to assume that for every rule written, for every assumption, there is an exception that can be found and not only that — you will very often find it by combing telemetry generated by non-malicious sources…