Writing better Yara rules in 2023…


This post has been featured in Josh Hammond’s video! Yay, thanks Josh!

Also, you should know that there are many other great resources online focusing on Yara performance improvements, efficiency, etc. f.ex:

Old Post

In my previous post I mused about an impossible task – how to consolidate a large, unorganized yara ruleset (that lots of us, admittedly, collect and hoard – just downloading it all, randomly, from all the corners of the internet w/o much thinking…) into a single, monolith, ready-to-use-for-a-quick-triage yara rule set…

Trust me: it’s not easy, but still doable, and in my previous post I kinda outlined a process that extracts, isolates and classifies each rule one by one, so that once they are all in separate files, we strip their metadata, deduplicate them, check their syntax for errors, find hidden duplicates (via compilation), and finally merge the surviving ones into a one, gigantic rule set that can be used directly with yara.exe (or compiled with yarac.exe for faster processing).

Yup, it’s not perfect, but this is one of these “it kinda worked for me” situations…

Since my last post I made lots of improvements to that idea and I can now painlessly merge any set of random yara rules with a minimal manual input, but with a caveat that I brutally ditch many of the ‘bad rules’ ‘in the process’… It’s their authors’ fault though 😉 Yup, I am serious – if your rule is kinda over-engineered – no one wants it, and this is why I also kill it at source cuz I don’t want it either 😉

Anyway… this post is not about automation. It’s about quality. A subject close to my heart, because this is my second post on yara rule writing ‘best practices’ – the first one was a disaster, because I was wrong on many fronts, and was corrected by the actual Yara devs… It was an eye-opening experience, for sure. That old post – I took it down.

And I do hope this post will be a bit better!

Reviewing many Yara rules written by others is a great researching adventure… You learn a lot not only about various optimization tricks, but also Yara syntax that you were not even aware existed… And yes, nothing works better for learning than a corpora of good examples… You can RTFM all day long and then someone breaks the rules (sic!) and uses the language syntax in a way no one ever thought of before… And yes, looking at all these rules you find mistakes too 🙂 Not only bad strings, but also wrong assumptions, bad copypasta, and sometimes a certain level of laziness…

So, let’s cover a few things that I learned from eyeballing of these 40K+ yara rules:

  • ditch PEiD-based yara rules – they are truly obsolete in 2023 and many of them are simply naive; also, 32-bit packers and protectors prevalent in 1990s and 2000s are not interesting anymore
  • ditch file properties and feature extraction-oriented rules – better tools exist today f.ex. DiE, capa; detecting that a file is a PE, PE.NET, ELF, MSI or DOC/X, or XLS/X via yara is simply NOT the best use of yara and IT IS AN ALREADY SOLVED PROBLEM on a modular level (see f.ex. dotnet.is_dotnet discussion below)
  • separate specific (client-specific, scope-specific, threat actor-specific, etc) and generic rules – the generic ones may be useful for scoped DFIR/loose triage/sweeps, or targeted VM assessment engagements, but you don’t want these FP-prone rules to be used on every ‘light’ scan across the fleet of many endpoints, and for many customers; there simply is a really high cost associated with reviewing results of many of these wishy-washy ‘detections’, and I guarantee you – you will find a lot of results… nowadays many DF/IR companies rely on yara-based scanners for their triage activities and trust me, reviewing the results that these tools produce is a very time-consuming task — notably, most of the hits are False Positives
  • separate public rules you want to share and private rules you want to keep to yourself
  • separate in-memory and file-specific yara rules; memory-only rules are great, but when ran against a file system, even on a clean installation of Windows 10 or 11 they may produce FPs; I have seen that happening; and then you have your TEMP and CACHE directories of your browsers; they will store snapshots of pages including websites/blogs/social media pages hosted by security vendors, yourself, etc. and if there is no file-specific conditions in your yara rule, you will get hits (FPs!); last, but not least – if your yara rule hits on itself, it is probably not a good yara rule at all, so add this self-check to your test routine!
  • ditch rules dependent on other rules – it’s really hard to process these in bulk; to merge these with a larger yara rules corpora w/o breaking something is impossible; yara rule dependencies are nothing but a fancy-pantsy show-off of one’s skillset, and – while they will surely find some local appeal – the moment you share them widely… such monstrosities will surely break many yara rules aggregating builds… that is – simple is BETTER, use KISS principle and avoid nesting conditions
  • keep rules high-fidelity, as much as you can…
  • but then again, for DFIR, Threat Hunting/Triage, and often incident-, or even client-, or scope-specific rules, keep them very wide-open and allow low-fidelity, if needed – in such cases it is actually acceptable and welcome, because the scope and objective is different… I must confess I literally wrote a very controversial yara rule for a very specific APT case — it was then published online and all hell broke loose when people started b*g about its quality; I still stand by it though, it was a beautiful example of a targeted yara rule that was detecting ‘surgical’ patches to a code introducing a skeleton key functionality to a logon request processing binary – it was horrendous on a grand scheme of things though, so it would produce a lot of FPs if you wanted to use it in retroscans on VT, but on systems of interest — it was detecting patches beautifully, and with 100% accuracy
  • if you can, keep a list of sample hashes you used for writing the rule inside the rule meta section; your internal QC process can use them for verification, and… you can also use an external QC process relying on yara-ci that can extract these for automated QC…; it takes care of both False Positives (via NSRL set) and False Negatives (via hashes you pass in the meta data)
  • having said that, remember that I exposed the NSRL set as being err.. a bit limited and quite outdated!
  • when you write rules, add as many conditions as possible, including a distinction between a non-MZ and MZ file, non-PE and PE file, an EXE and a DLL, 32- and 64-bit; .net and non-.net executables, driver and non-driver – yes, this may be an overkill sometimes, but we do want that precision, long-term; think of it – just checking ‘MZ’ at the top of the files is NOT enough; we don’t want to include Dos or Windows 3.1/Dos4GW MZ/NE/LE executables in any sampleset ‘caught by yara rules’ written in 2023…; plus, more file types and subtypes appear out there on regular basis, so we need to keep some focus – as a result, we do want to keep libraries of yara rules that target different architectures (Intel, ARM), OSes (Windows, Linux, macOS), different programming languages (non-.NET vs. .NET, C, Delphi, VB, Go, Rust, Nim, etc.), wrappers (autoit, pyinstaller, pyarmor, ps2exe, etc.), webshells (php, asp, jsp, etc.) separate and easily recognizable (that granularity may help to surgically deploy subset of yara rules against specific file types, too)
  • keep rules readable – these cryptic oneliner conditions, no comments at all — it’s just a poor hygiene, so:
    • add as many comments to explain logic as you can
    • explain / include printable version of values of hexadecimal strings & the reason they were chosen (include printable versions of these!)
    • comment on code snippets to explain uniqueness of these code sequences…
  • keep yourself up to date with yara documentation – it changes often, and this project keeps on delivering lots of new features all the time – often, these ‘novelties’ simplify writing a lot of detections that used to be tough to write in the past (and are now simplified using file- type-specific modules, loops, ranges, console module, etc.); for example, ‘pe.pdb_path icontains’ may be better today than a PDB string found inside the binary _anywhere_
  • stop using ‘stupid’ strings for detections; they may look good at the time of targeted rule-writing, but they are not so good in the grand scheme of things; c’mon… rules that use strings like these should be literally banned!:
    • KERNEL32
    • NTDLL
    • USER32
    • WS2_32
    • ADVAPI32
    • NSS3
    • rundll32.exe
    • .text
    • <input type=\”
    • GlobalSign
    • Thawte
    • Response.Write
    • BSJB // this is .NET PE file ‘magic’
    • #Strings
    • #GUID
    • C#
    • cmd.exe
    • powershell
    • POST
    • Content-Type: application/x-www-form-urlencoded
    • VirtualProtect
    • DecryptFileA
    • CreateThread
    • WriteProcessMemory
    • echo
    • path
    • AssemblyTitle
    • cookies.sqlite
    • <?php
    • JScript
    • <?xml version
    • request
    • java.lang.
    • public
    • server
    • CreateObject
    • [InternetShortcut]
    • and many more…

All of these strings will hit gazillion of clean files and are POOR string detection choices. Yes, we need some of them to determine a file type, and yes, we need some of them to confirm the file is a scripting language, but be aware that many of these strings exist inside looooots of clean files! This means that using them is an equivalent of lots of wasted CPU cycles!

  • Improve your wildcard/regex-fu:
    • stop using regex wildcards haphazardly… in the worst case scenario, your “.*” or “.+” may end up slowing the scanning process substantially if it is applied to a large file (f.ex. 300MB MZ/PE); use .{0,30}, .{4,16} instead… that is… smaller, more practical and controlled ranges – let that regex engine breathe, and exit gracefully, and… relatively early
  • do not rely on ‘pe.imports(“mscoree.dll”)’ to determine if the MZ executable is a .NET executable; many modern .NET executable DO NOT import ‘mscoree.dll’! f.ex. see 323d7b7d1eb3501e183c7d58bcb46c0e12f56e432765cfcb6302b8e4fe49842d
  • Better rely on the following construct that ensures the NETTblRVA is not empty for both 32- and 64- bit MZ/PE executables:
        uint16(0) == 0x5A4D and
        (uint32(uint32(0x3C)) == 0x00004550) and 
        ((uint16(uint32(0x3C)+24) == 0x010B and uint32(uint32(0x3C)+232) > 0) or (uint16(uint32(0x3C)+24) == 0x020B and uint32(uint32(0x3C)+248) > 0)) and
  • or, in newer versions of yara/dotnet, you can simply use
	  dotnet.is_dotnet == 1
  • try this on your sample zoo – you will get surprised:
import "pe"
import "console"
rule isdotnet_no_mscoree : windows
	  (dotnet.is_dotnet == 1) and
	  (not (pe.imports("mscoree.dll")))
  • build an awareness of how popular NULL character is; avoid using it as an anchor/atom that ‘prefixes’ your ‘unique’ strings
  • learn machine opcodes for the most common CPU instructions (and if you are into S/M: MMX, SSE, AVX)
    • you don’t need to rely on ‘strings’ only – learn at least basic code opcodes & recognize variable parts of machine instructions and their sequences — this will help you to choose better, more unique code blocks as solid detections f.ex. functions responsible for calculating API hashes, RC4 or Luhn formula calculating routines, config decryption routines specific to a malware family, threat author, shared code, etc. – surprisingly, they often stay the same across many compilations/builds (on a binary level)
  • don’t use these, please…
    • pe.imports(“kernel32.dll”,”CreateFileA”)’
    • pe.imports(“user32.dll”,”FindWindowA”)
    • pe.imports(“kernel32.dll”,”GetModuleHandleA”)
    • pe.imports(“kernel32.dll”,”GetModuleFileNameA”)
    • pe.imports(“kernel32.dll”,”IsDebuggerPresent”)
    • pe.imports(“kernel32.dll”,”TerminateProcess”)
    • pe.imports(“kernel32.dll”,”GetTickCount”)
  • Look at clean files, on regular basis:
    • Rust, Go, Nim executable files – they are “different” — they include lots of garbage, lots of nothingness (from a yara perspective, at least); if you are not familiar with their clean sampleset, you may choose wrong strings that may hit on many clean files!
    • Signed executables are not always GOOD; many threat actors use stolen certificates to sign their malicious productions
    • Javascript files that look GOOD may not be GOOD — threat actors are clever and poison legitimate code with their injects; ‘at first glance’ the javascript may look ‘ok’ but it’s not
    • Newer native Windows executables destroyed the meaning of a PE compilation timestamp ! So, now we have exceptions for native OS executables and older Delphi files. Still, it may be a good yara condition to include only these files with a compilation stamp within a certain timeframe
    • Use new compiler and linker-specific artifacts to your advantage – many new file properties that TAs have not caught up with yet; these may be good inclusions or exclusions!
    • Even your best code-based signature cannot escape the scrutiny of gazillion of clean samples compiled with the same binary string present in it; for example, some very popular public yara repositories use ‘unique’ code strings to detect a certain code condition in a binary; yes, it works on a cluster of that particular malware family samples, but this condition and its binary representation is also shared by a lot of clean files; as a result, FPs all over the place — so, when you decide to do code-based sigs make sure you are hitting the code of a Threat Actor and not a run-of-the-mill legit code sample that is being copied, pasted and compiled all over the internet!


  • make sure you peer review your rules, and then do retrohunts or do content:<hex string> searches on VT to confirm your hunches…
  • if you have a sampleset of clean files, run your rules against them

Newer yara builds support ‘Console’ module allowing you to print out the values extracted from, and conditions applied to the files. The yara’s -D arguments allows you to extract the values as ‘seen’ by yara modules f.ex. ‘elf’, ‘pefile’, ‘dotnet’. Use them, use them across many samples, compare the results, build some stats around these results, and then find unique similarities within clusters of samples and then codify it. Then again, test against VT, test against clean files corpora, then (if no FP detected) push to production.

The very last point I want to cover is esthetics. Many people who hunt for new malware fail at the most crucial point: communication about their detection methodology. And in a case of Yara rules, let’s be honest – writing these is not easy. BUT. Yara rules analysis is…. even harder. Especially if the writer omits details. And you name it… if they are not beautified, if their strings are not readable, if their conditions are not explained, if they look esoteric, cryptic, or otherwise ‘3l33t’ they are not useful at all! Let’s face it… the best yara rules are the ones that are telling some sort of a story and/or help all the other yara writers to build better yara writing habits…

And in the interest of full disclosure, i hope I stick to at least 50% of what I preach here… 🙂