A short wishlist for signature/rules writers

January 21, 2019 in Preaching

Oh, no… yet another rant.

Not really.

Today I will try to discuss a phenomenon that I observe in a signature/rules writing space – one that used to be predominantly occupied by antivirus companies. And today it is a daily bread for malware analysts, sample hoarders, threat intel folks, and threat hunters crowd as well. Plus, lots of EDR and forensics/IR solutions use these, and they come really handy in memory analysis, as well as retrohunting and sorting samples collections.

The phenomenon I want to talk about relies on these two factors:

  • writing signatures / rules is very easy
  • there is no punishment for writing bad ones

This phenomenon is, as you probably guessed by now, the uneven quality of signatures / rules. And I am very careful with words here. Some of these rules are excellent, and some, could be better.

Before I proceed further, let me ask you a question:

What is the most important part of a good yara signature?

  • Is it a unique string, or a few of them?
  • A clever Boolean Condition?
  • A filter that cherry-picks scanned files, e.g. for the Windows Executables looks for the ‘MZ’ files, then ‘PE’ header?

These are all equally important, and used in most of the yara rules you will find today. I find it interesting though that most rules don’t include the ‘filesize’. And I wonder why? This filter helps to exclude tones of legitimate files, and malicious files that are outside of the file size range used by the family the specific yara rule covers. If applied properly, it will potentially skip expensive searches inside the whole file.

Update: I stand corrected, it would seem the ‘filesize’ is checked _after_ the strings are checked (thx Silas and Wesley — <hit this thx bit to visit twitter convo>). This is a poor performance optimization choice, in my view. Still, see below what I wrote below about this scenario exactly – it doesn’t matter if yara performs poorly on this condition today, they may improve it tomorrow. Additionally, the way we use yara rules and how they are compiled matters! In a curated ruleset the issues I am referring to don’t make much difference. It does make a difference with individual scans on e.g. file system. In my experience, many of the rules I get from the public sources can’t be combined/compiled into a one single bulky rule, because of conflicts. So I tend to run yara.exe many times, each time using different yara rule files. See the Twitter convo for some interesting back and forth between us. Thanks guys!

I think the practical reason why many analysts forget about this condition is pretty basic. It’s very rare for any of us to write rules that must be optimized, and checked for quality. While the signatures in AV industry go through a lot of testing before they are released, our creations are deployed often as soon as they are written and tested on a bunch of sample files only, and very rarely on larger sampleset that include lots of files, including large ones, clean ones, and tricky ones (intended to break parsers & rules e.g. corrupted files).

Update: Important to mention that based on our Twitter convo, it again depends very much on the circumstances. It is possible that in your environment, or your needs do not require checking this.

Our rules are typically for a local consumption, so performance or accuracy are not necessary a priority. But performance is important. And even more – a different mindset.

We write rules to detect i.e. include matching patterns, but not to exclude non-matching ones. And the latter is important – the faster we can detect that the file doesn’t match our rule, the faster the yara engine can finish testing the file and move on to the next.

And even if yara engine was the worst searching engine ever, and was actually reading the whole file, and the ‘filesize’ condition was not really helping performance, it would still make sense to write rules in ‘the best effort’ way. There is always a new version of the engine, the authors take the feedback in, and one day a future version may optimize code and comparisons for exactly this condition.

Coincidentally, this is actually one of the principles that most antivirus engineers learn very early in their career: learn to exclude non-matching stuff, and learn to do it early in your detection rule.

The sole intention of this post is to highlight the importance of thinking of signatures / rules not only in a category our ways to quickly detect stuff, but in a wider context – a way to ignore the non-matching stuff. The earlier in the rule, the better.

Update: After the Twitter convo I now know I chose a wrong example to illustrate my point. I should have used f.ex. common ‘good’ strings that we can sometimes find in public yara rules (these strings can be found in both malware and good files because they are part of a library). The hits that these strings generate on ‘good’ files can be avoided by testing rules on larger corpora of samples, including ‘good’ files. There are plenty of other examples.

Share this :)

Comments are closed.