Optimizing the regexes, or not

Every once in a while we all contemplate solving interesting yet kinda abstract threat hunting problems. This post describes one of these…

The problem:

Given a relatively long number of strings, how do you write a regular expression that covers them all, but doesn’t hit on any other string?

The context:

I have extracted file names associated with kernel drivers referenced by all the .inf files present inside all of (unpacked) archives that can be found inside the DriverPack.

The rationale:

Hunting for new kernel drivers introduced to the environment may be easier if I can extract kernel driver names from the telemetry, and only report creation of these that reference files that are NOT present on the ‘known list of good kernel driver file names’.

The solution:

Looking for existing tools that may help to address this problem in a generic way I came across this perl module – Regexp::Optimizer. To my surprise, it actually works quite nicely.

I gave it 7.5K file names associated with ‘known clean kernel module drivers’ and it gave me the following regex. I have tested all the file names from the ‘ServiceBinary2su.txt’ file and the regex worked well. This is the test script:

use strict;
use warnings;
use utf8;

$| = 1;

my $f=’regex.txt’;
open F,”<$f”;
binmode F;
read F,my $regex,-s $f;
close F;

my $x=shift;
if ($x=~/^$regex.sys$/i)
{
print (“$x matched\n”);
}
else
{
print (“$x didn’t match\n”);
}

The final regex is 52624 bytes long. The input data was 103317 bytes long (including new lines). We have achieved a 51% ‘compression rate’, but debugging of such a complicated regex pattern sounds like a heck of a job. It would seem that sometimes solving interesting yet kinda abstract threat hunting problems brings more confusion to the process than we anticipate… And getting fixated on using regexes to solve this kind of problem is actually a bigger problem itself. The multi-pattern search-oriented trie structures are far more suitable to solve this sort of multi-pattern search/comparisons.

Hexacorn

Hexacorn

Optimizing the regexes, or not