Yara Carpet Bomber

A lot of people are sharing their Yara creation (look for #100DaysofYARA tag on Twitter), so I thought I will share a bit too.

This is a very unusual way of using Yara and I hope you will find it interesting.

When we think of Yara rules we usually have very specific cluster of strings in mind – formed by be it an API, a debug string, a snippet of code, etc. What if instead we used yara to scan files for much large sets of strings? While it may sound counterintuitive, Yara is really very well prepared to do “carpet bombing” string scans on target files. It’s actually super fast and efficient.

Let’s have a look at an example.

Imagine that you want to find all English words inside a file. I choose “English” because it’s easy to demo, but you could use any other language really. The traditional approach would rely on running the “strings” tool over the target file and then manually combing through the results, cherry-picking words that “look” English. For other languages you may need a localized version of “strings” tool (e.g. my old tool hstrings could help), but the principle is the same. In some cases you could also apply knowledge of file structure so that could extract some of the strings ‘natively’ (e.g. from resources in PE file).

We can also approach it from a different angle. We will build a list of all English words and then search for them in the file. All at once. There are obvious caveats – we can never sure we have a list of all English words e.g. gobbledygook or ragamuffin may not be on the list, and short words will certainly be causing a lot of False Positives, but it’s just a POC of an idea.

So, we find a random English words list. We write a small script to extract all 6+ character long strings and exclude strings starting with digits and we then convert it into a set of Yara rules. Yara accepts up to 10K strings per rule so we have to split the dictionary into multiple rules.

 my $cnt=0;
 my $n=0;
 while (<>)
 {
     s/[\r\n]+//g;
     next if length($_)<6;
     next if /^[0-9]/;
     s/\"/\\"/g;
 if ($n==0)
  {
   print "
 rule ".sprintf("eng_%04d", $cnt)."
 {
  strings:
 ";
   }
     print "\$ = \"$_\" ascii wide nocase\n";
     $n++;
     if ($n>9999)
     {
       $cnt++;
       $n=0;
     print "
        condition:
          any of them
     }
     ";
     }
 }
 print "condition:
         any of them
 }
 ";

The resulting rules can be saved into eng.yar file and then compiled with yarac to eng.yac:

yarac eng.yar eng.yac

We will get a lot of warnings about the rule slowing down the scanning, but who cares 🙂

 warning: rule "eng_0000" in eng.yar(10008): rule is slowing down scanning
 warning: rule "eng_0001" in eng.yar(20016): rule is slowing down scanning
 warning: rule "eng_0002" in eng.yar(30024): rule is slowing down scanning
 warning: rule "eng_0003" in eng.yar(40032): rule is slowing down scanning
 warning: rule "eng_0004" in eng.yar(50040): rule is slowing down scanning
 ...

Note, the resulting file is gigantic – ~600MB in size. You can reduce is by mingling with “ascii wide nocase” sets (if you exclude them, the file will be only ~70MB).

We can now use the rules on e.g. Notepad:

yara -s -C eng.yac c:\windows\notepad.exe

-s – will extract strings
-C – will tell yara the rules are compiled

The results will look like this:

eng_0000 c:\windows\notepad.exe
 0x280b5:$: Accelerator
 0x2822a:$: Accelerator
 0x2822a:$: Accelerators
 0x26f00:$: Accept
 0x2b9bd:$: Access
 0x2862e:$: Acquire
 0x286f6:$: Acquire
 0x2862e:$: AcquireS
 0x286f6:$: AcquireS
 0x28df9:$: Activation
 0x27faf:$: Active
 0x28e04:$: actory
 0x286d9:$: Address
 0x289c5:$: alLock
 0x28b68:$: alLock
 eng_0001 c:\windows\notepad.exe
 0x25050:$: A\x00p\x00p\x00l\x00i\x00c\x00a\x00t\x00i\x00o\x00n\x00
 0x25260:$: A\x00p\x00p\x00l\x00i\x00c\x00a\x00t\x00i\x00o\x00n\x00
 0x2ba0f:$: application
 0x2baf4:$: application
 eng_0002 c:\windows\notepad.exe
 0x2b75e:$: Archit
 0x2b88f:$: Archit
 0x2b75e:$: Architect
 0x2b88f:$: Architect
 0x2b75e:$: Architecture
 0x2b88f:$: Architecture
 0x227ba:$: A\x00r\x00o\x00u\x00n\x00d\x00
 0x2b6c8:$: assembl
 0x2b713:$: assembl
 0x2b7e5:$: Assembl
 0x2b7f9:$: assembl
 [...]

Playing CAPAeira with Yara rules

Writing Yara rules is easy. Writing good Yara rules is … testing – both as an adjective and a verb.

There is a class of Yara rules – the one that relies on actual machine code – that we can do better now.

How?

Your typical approach to writing code-based Yara sigs is relying on byte streams of machine code extracted from analyzed programs – usually a very specific code sequence of interest (e.g. RC4 algo, Luhn check routine, etc.). We then ‘patch’ offsets in jumps, calls, etc. to account for their variability.

Such Yara rules are common and pretty handy. They work most of the time, but there is a caveat. Compiler and malicious coder’s tricks may shift machine code around and as a result, some code sequences may differ. As such, a pretty decent Yara rule based on a very specific program code may fail on newer samples.

In order to improve efficiency of code-based Yara signatures we can now use capa.

You may be laughing now – capa itself is a detection engine. Given a bunch of samples, we could just run our capa rules over them and get detections we need. The problem is the speed. The second problem is that while Yara is supported by nearly everything that blinkenlights, Capa is not.

The best approach is therefore to analyze the code, write your good capa signature. And then, use it to test your Yara rules. Your Yara rule must detect the very same sampleset that Capa hits on. This is an iterative process, but allows to cherry-pick variants and subtle differences in implementation that can then lead you to improve your Yara sigs. Moreso, if you have other ways to detect samples as belonging to a certain malware family, you can then correlate it against your family-specific Capa- and Yara- rulesets and highlight missing Yara rules. Using the Capa output you could auto-generate Yara rules as well (although this is a bit silly w/o manual oversight /it would literally be like hashing, if blindly automated/).

The task of correlating the capa and yara detections/rules can be delegated to existing Python libraries – something along these lines:

import yara
import capa.main
import capa.rules
from capa.features import ARCH_X32, ARCH_X64, String
from capa.features.insn import Number, Offset
...
yr = yara.compile(filepath='foo.yar')
fm = yr.match(filename)
if fm:
   ... fm[0] ...

cr = capa.main.get_rules('foo.yml', 
    disable_progress=True)
cr.rules.RuleSet(cr)
ex = cr.main.get_extractor (fn, "auto", 
    disable_progress=True)
ca, ccn = cr.main.find_capabilities(cr, ex, 
    disable_progress=True)
try:
   ... capabilities.keys() ...

<print output, match, whatever>
...