Hexacorn

One of the most common use cases we come across during our malware analysis exercises is a ROI-driven comparison of features between many samples of the same malware family. Yes, we can use BinDiff, Diaphora (and we should), but if it is a time-sensitive research, we need to take some shortcuts to deliver early results pronto.

Here’s one way to do it.

Note: I have used this approach many times in the past, because it’s simple, easy to understand, and produces a visual that is VERY nice to include in your deliverables. Your customers do not want to see the lengthy reports including all gore metadata and strings extracted from each and every sample. They want to see THE STORY.

A high-level matrix showing a difference between different versions of malware can be often quickly built by looking at, and comparing strings extracted from multiple samples (I simplify it a bit here: yes, in many cases you may need to decrypt these strings first, but it’s a reversing task we are all used to, so I won’t be covering it here).

So, how do we go about it?

After the basic triage, we extract strings selectively from all the samples of malware we have and we put them all in Excel. Like this:

We now have a simple database of ‘features’ or unique capabilities of each sample. There are similarities between them that are immediately striking, plus, some features seem to exist across many samples, and some are just one-offs. As a side note: all these one-offs are VERY interesting as they often are examples of bad OPSEC, and/or may reveal some of the additional intentions/motivations of the attackers. In some rare cases, the older compilers embed so-called ‘dead code’ (never executed) in the final executable, so strings extracted from such sample provide a rare intel that may help with attribution. True story: over a decade ago I have analyzed one such sample and the dead code gave away a lot of info about the attacker – that was enough to pinpoint the exact individual responsible for that particular hacking spree.

We now want to build a superset list of all features, and for each sample, put ‘yes’ if the feature is present, and ‘no’ – if it is not.

So, we first add a new column to which we will copy and paste all the filled-in cells from each sample’s column, one by one:

We then remove all duplicates from the ‘All’ column by using Data / Remove Duplicates function:

giving us this as a result:

Let me now show you what we want to build from this data: the final product we are after is a matrix of all interesting strings / features (column ‘All’) cross-referenced with each sample’s strings:

How do we build it?

We use three functions: VLOOKUP, ISNA and IF.

VLOOKUP helps us to find a given string in a specific column. If it exists, it will simply give us its value, otherwise it will return “N/A” (not available). We then use ISNA function to test the result of VLOOKUP. If it is “N/A”, we output “No”, otherwise we output “Yes”. We then add some basic cell formatting on top of it (add borders to make it look like a table, center text vertically and horizontally within the cells) + add some conditional formatting (if cell contains “Yes”, make it Blue, and if “No”, make it Red) and we get the result shown above.

These are formulas I have used in this example:

It may look complicated at first, but the logic is brutally simple:

We start by using a VLOOKUP function — we look for a string from a fixed column $E ‘All’ (dollar in front of it is to avoid it changing to F, G, H in subsequent horizontal formula copy via CTRL+R mentioned below), within the A:A column (which is sample1), and if it is found (ISNA returns false) we say ‘Yes’, otherwise, we say ‘No’. We populate this formula with CTRL+D (vertically), and CTRL+R (horizontally) and all the other cells should be now filled in with formulas like on the above picture.

When we build a matrix like this we can immediately spot some interesting bits about the samples:

sample4 may be the earliest as it includes `written by bored Bob`, possible OPSEC fail and attribution bit here
sample1 and sample3 seem to be the most advanced, with sample3 being probably the latest as it offers remoteshell capabilities; the ‘screencapture’ feature present in sample1 is not present in sample3 which could be explained as ‘it’s 2023 and video streaming is a thing aka screenshots are so 2010’

Of course, there are million other ways to do the very same task. And after all, it’s a manual and kinda mundane task to do this via Excel, but again, there are some lessons learned here:

by experimenting like this we build processes that can be then automated with better tools (f.ex. python, perl)
data presented in tables speaks to customers better than the most comprehensive reverse engineering efforts (none of them want to look at Ida or Ghidra screenshots; they want root-cause analysis, TTPs, IOCs, high-level description of features and capabilities, and what has changed between malware versions found on their systems)
in many cases I encountered, especially in a Linux world (ELF files), this is more than enough to pinpoint main differences between samples of the same malware family; it’s a great time saver!
even more interesting is another bit — for many trojanized programs or libraries (again, especially in Linux world), string-based comparisons against their clean versions often yield really great results (somehow, threat actors love to add a lot of extra debugging strings, messages, etc. that immediately stand out)

A few days ago I posted a very specific question on Twitter and Mastodon:

You’ve got gazillion of random yara rules stored inside many random .yar files scattered around many folders. What do you use to read them all, remove duplicates, ensure all rule names are unique, and all the unique rules end up in a ‘merged’ final .yar file (or files)? I am aware of these projects & gists:
https://github.com/plyara/plyara
https://github.com/lsoumille/Yara_Merger
https://gist.github.com/Neo23x0/577926e34183b4cedd76aa33f6e4dfa3
https://gist.github.com/Neo23x0/81990b8e5eb351a118dca1d5f2a2a86b
https://gist.github.com/notareverser/7

I got 2 interesting answers:

Thanks AllenSwackhamer and bmmaloney97!

Still, I wanted something simpler. I just want to build a single, ‘megalopolis’ type of yara file that includes all yara rules I have ever saved.

I am a hoarder, so anytime I come across some interesting code (f.ex. c, idapython, idc, etc.), signatures and rules (flirt, yara, capa, etc.), file formats, compression, exploitation bits, bobs and PoCs, info on new attacks, any info really posted on social media, web sites, advertised via rss feeds, whatever, I just bookmark it, or download it, and I don’t really spend much time categorizing, deduplicating and organizing it. Despite many attempts over the years to make it ‘easier on me’ I always end up having it stored all over the place. I really wish I was more Marie Kondo, but it’s a mess.

I literally have a pile of different yara rules collected over last 8+ years, many of them written by me of course, all scattered across many folders, and with my aforementioned question on social media I simply wanted to achieve one thing: walk through all of these yara files, deduplicate, remove all the poor quality rules (f.ex. many PEiD rules), remove all complex rules (f.ex. where one rule depends on another), and also remove any rules related to Android malware, because I really don’t have much interest in this topic.

Many of the approaches presented by very mature projects and gists listed above focus on yara rules seen from a source code perspective. That is, you can use existing libraries to parse these yara rules, maybe calculate hashes of their bodies, and do a lot of interesting things. But then again, I wanted something simpler.

So I devised a cunning plan aka THE YARA PAGEANT ALGORITHM:

manual step: find all yara rules on my system, copy them all to one place, use subfolders where applicable, but don’t care for duplicates, multiple versions of the same github repo, file, just drop them all in one place… it is… all okay… etc.
now I have a single place where I have stored ALL rules I have ever collected (this is mainly to improve performance – easier to scan one directory with a script than scanning all drives)
now I can run a script that will comb through all these yara rule files, parse them all, and extract every single atomic yara rule into a separate file; that is — if a file stores one yara rule, only one output yara rule file will be saved, but if we have a collection, we export every single unique yara rule into a separate file
while doing so, we remove all the meta data; that is, we leave strings and/or conditions only; still, we should copy the body of the original rule into the final file, just commented out — this is not only for troubleshooting, but it also helps to preserve a crucial info about the rule — the info we will need if the rule ends up in a final ‘mega’ file and we want to understand where it came from and what other meta data info is available for it
to ensure names do not collide, we can add a prefix in a form of f.ex. ‘h_<counter>_’ to every single rule name where counter increases for every single rule file written to disk
while processing them in bulk, we can exclude many rules that rely on external information: filename, extension, or known macros f.ex. IS_PE; who cares… it’s a small percentage, ROI is low, let’s just ditch them
to ensure it actually works, we add a universal prologue to every single output yara file created; the prologue consists of:

import "console"
import "elf"
import "hash"
import "math"
import "pe"
import "dotnet"

This ensures all the module dependencies are resolved (except for androguard, but I ignored it ‘by design’)

Note that it may help to use a RAM disk for this exercise!
once we go through every single source file, we end up with a gazillion single-rule yara files
we now use yarac to compile every single one of these yara files, yes, one by one
we ignore these that don’t compile — it’s a small percentage of all rules
with no metadata, many of them are just atomic detections that compile to a very specific, binary form
once all of them are compiled, we remove all compiled files that are duplicates; that is, these where binary output of yarac is identical to a yarac output produced for another rule
we know have a directory with a gazillion of individual yara rules, plus, for these that are unique, we also have them compiled
for every single yara file where there exists its compiled version, we add it to the final ‘mega’ yara file
once the ‘mega’ file is completed, we run it via yarac to create a ‘mega’ compiled version of all unique rules

That’s it.

Going forward, we simply run ‘yara -w -C <compiled mega yara file> <malware file>’ to have all these rules applied to the target file. If you have many yara files in your ‘mega’ pack you may see rules hitting on file properties, features, and if you are lucky – specific TA or malware family may sometimes hit too. It helps to use ‘-s’ argument so see the exact strings that are extracted from the sample that hit the rule so you can quickly tweak the ‘mega’ source file, recompile and avoid FPs in next runs.

I wish I could share the source code of my script and commands doing all the stuff I described above, or even my own mega yara pack. But I can’t. It’s a spaghetti code, some of the rules are super private, and in the end, your needs may be different from mine. Still, nothing can stop you from starting your own Yara pageant today…

Hexacorn

Excelling at Excel, Part 3

Yara rules pageant