Good Exports are real

Collecting ‘good’ samples helps to discover a lot interesting patterns. In my old post I focused on the PDB paths extracted from the DriverPack driver collection, yesterday I touched on the list of ‘file names associated with good known kernel drivers’, and today I will cover the function names exported by a very large corpora of ‘good’ DLL samples.

You may ask what is the value here, and I can answer that ‘this is how the normal looks like’.

How is that useful to the Threat Hunting crowd?

If you monitor rundll32 invocations referencing DLLs and their API functions you may quickly discover a lot of anomalies. Any invocation referring to a non-OS DLL is suspicious. Any invocation referring to a DLL in a suspicious location is… suspicious. Any process using unusual constructs is suspicious. Any process invoking DLL exported functions via ordinal numbers is suspicious. Any process referencing API ordinals via negative or large ordinal numbers is super-suspicious, too.

These are great ‘suspicious’ tests, but we can do more.

The ‘StartW’ export used by Cobalt Strike DLLs is a good example. Invocations of this function are not necessarily ‘suspicious’ by default, because we don’t have a point of reference. There are so many legitimate invocations of rundll32 executing exported functions from so many DLLs that it’s hard to zoom-in on this particular function and declare that it’s bad. Again, we need a point of reference, of sort.

The list of functions exported by ‘good’ DLLs is far longer than expected: 11375507 unique entries, with many very popular and some only occurring once. You can download an archived text file referencing many ‘good export names’ from here.

There are so many uses for this set:

  • known-good names for threat hunting purposes
  • a very fertile ground for a deeper lolbin research
  • a very fertile ground for discovering new vulnerabilities

The set is watermarked hence you have been warned. You cannot use this set for any commercial reason. You cannot create any commercial detection based on this data. The only exceptions are: fully unlimited use by law enforcement, and for educational and non-commercial research purposes only.

Optimizing the regexes, or not

Every once in a while we all contemplate solving interesting yet kinda abstract threat hunting problems. This post describes one of these…

The problem:

Given a relatively long number of strings, how do you write a regular expression that covers them all, but doesn’t hit on any other string?

The context:

I have extracted file names associated with kernel drivers referenced by all the .inf files present inside all of (unpacked) archives that can be found inside the DriverPack.

The rationale:

Hunting for new kernel drivers introduced to the environment may be easier if I can extract kernel driver names from the telemetry, and only report creation of these that reference files that are NOT present on the ‘known list of good kernel driver file names’.

The solution:

Looking for existing tools that may help to address this problem in a generic way I came across this perl module – Regexp::Optimizer. To my surprise, it actually works quite nicely.

I gave it 7.5K file names associated with ‘known clean kernel module drivers’ and it gave me the following regex. I have tested all the file names from the ‘ServiceBinary2su.txt’ file and the regex worked well. This is the test script:

use strict;
use warnings;
use utf8;

$| = 1;

my $f=’regex.txt’;
open F,”<$f”;
binmode F;
read F,my $regex,-s $f;
close F;

my $x=shift;
if ($x=~/^$regex.sys$/i)
{
print (“$x matched\n”);
}
else
{
print (“$x didn’t match\n”);
}

The final regex is 52624 bytes long. The input data was 103317 bytes long (including new lines). We have achieved a 51% ‘compression rate’, but debugging of such a complicated regex pattern sounds like a heck of a job. It would seem that sometimes solving interesting yet kinda abstract threat hunting problems brings more confusion to the process than we anticipate… And getting fixated on using regexes to solve this kind of problem is actually a bigger problem itself. The multi-pattern search-oriented trie structures are far more suitable to solve this sort of multi-pattern search/comparisons.