Yara Carpet Bomber

A lot of people are sharing their Yara creation (look for #100DaysofYARA tag on Twitter), so I thought I will share a bit too.

This is a very unusual way of using Yara and I hope you will find it interesting.

When we think of Yara rules we usually have very specific cluster of strings in mind – formed by be it an API, a debug string, a snippet of code, etc. What if instead we used yara to scan files for much large sets of strings? While it may sound counterintuitive, Yara is really very well prepared to do “carpet bombing” string scans on target files. It’s actually super fast and efficient.

Let’s have a look at an example.

Imagine that you want to find all English words inside a file. I choose “English” because it’s easy to demo, but you could use any other language really. The traditional approach would rely on running the “strings” tool over the target file and then manually combing through the results, cherry-picking words that “look” English. For other languages you may need a localized version of “strings” tool (e.g. my old tool hstrings could help), but the principle is the same. In some cases you could also apply knowledge of file structure so that could extract some of the strings ‘natively’ (e.g. from resources in PE file).

We can also approach it from a different angle. We will build a list of all English words and then search for them in the file. All at once. There are obvious caveats – we can never sure we have a list of all English words e.g. gobbledygook or ragamuffin may not be on the list, and short words will certainly be causing a lot of False Positives, but it’s just a POC of an idea.

So, we find a random English words list. We write a small script to extract all 6+ character long strings and exclude strings starting with digits and we then convert it into a set of Yara rules. Yara accepts up to 10K strings per rule so we have to split the dictionary into multiple rules.

 my $cnt=0;
 my $n=0;
 while (<>)
     next if length($_)<6;
     next if /^[0-9]/;
 if ($n==0)
   print "
 rule ".sprintf("eng_%04d", $cnt)."
     print "\$ = \"$_\" ascii wide nocase\n";
     if ($n>9999)
     print "
          any of them
 print "condition:
         any of them

The resulting rules can be saved into eng.yar file and then compiled with yarac to eng.yac:

yarac eng.yar eng.yac

We will get a lot of warnings about the rule slowing down the scanning, but who cares 🙂

 warning: rule "eng_0000" in eng.yar(10008): rule is slowing down scanning
 warning: rule "eng_0001" in eng.yar(20016): rule is slowing down scanning
 warning: rule "eng_0002" in eng.yar(30024): rule is slowing down scanning
 warning: rule "eng_0003" in eng.yar(40032): rule is slowing down scanning
 warning: rule "eng_0004" in eng.yar(50040): rule is slowing down scanning

Note, the resulting file is gigantic – ~600MB in size. You can reduce is by mingling with “ascii wide nocase” sets (if you exclude them, the file will be only ~70MB).

We can now use the rules on e.g. Notepad:

yara -s -C eng.yac c:\windows\notepad.exe

-s – will extract strings
-C – will tell yara the rules are compiled

The results will look like this:

eng_0000 c:\windows\notepad.exe
 0x280b5:$: Accelerator
 0x2822a:$: Accelerator
 0x2822a:$: Accelerators
 0x26f00:$: Accept
 0x2b9bd:$: Access
 0x2862e:$: Acquire
 0x286f6:$: Acquire
 0x2862e:$: AcquireS
 0x286f6:$: AcquireS
 0x28df9:$: Activation
 0x27faf:$: Active
 0x28e04:$: actory
 0x286d9:$: Address
 0x289c5:$: alLock
 0x28b68:$: alLock
 eng_0001 c:\windows\notepad.exe
 0x25050:$: A\x00p\x00p\x00l\x00i\x00c\x00a\x00t\x00i\x00o\x00n\x00
 0x25260:$: A\x00p\x00p\x00l\x00i\x00c\x00a\x00t\x00i\x00o\x00n\x00
 0x2ba0f:$: application
 0x2baf4:$: application
 eng_0002 c:\windows\notepad.exe
 0x2b75e:$: Archit
 0x2b88f:$: Archit
 0x2b75e:$: Architect
 0x2b88f:$: Architect
 0x2b75e:$: Architecture
 0x2b88f:$: Architecture
 0x227ba:$: A\x00r\x00o\x00u\x00n\x00d\x00
 0x2b6c8:$: assembl
 0x2b713:$: assembl
 0x2b7e5:$: Assembl
 0x2b7f9:$: assembl