Excelling at Excel, Part 3

One of the most common use cases we come across during our malware analysis exercises is a ROI-driven comparison of features between many samples of the same malware family. Yes, we can use BinDiff, Diaphora (and we should), but if it is a time-sensitive research, we need to take some shortcuts to deliver early results pronto.

Here’s one way to do it.

Note: I have used this approach many times in the past, because it’s simple, easy to understand, and produces a visual that is VERY nice to include in your deliverables. Your customers do not want to see the lengthy reports including all gore metadata and strings extracted from each and every sample. They want to see THE STORY.

A high-level matrix showing a difference between different versions of malware can be often quickly built by looking at, and comparing strings extracted from multiple samples (I simplify it a bit here: yes, in many cases you may need to decrypt these strings first, but it’s a reversing task we are all used to, so I won’t be covering it here).

So, how do we go about it?

After the basic triage, we extract strings selectively from all the samples of malware we have and we put them all in Excel. Like this:

We now have a simple database of ‘features’ or unique capabilities of each sample. There are similarities between them that are immediately striking, plus, some features seem to exist across many samples, and some are just one-offs. As a side note: all these one-offs are VERY interesting as they often are examples of bad OPSEC, and/or may reveal some of the additional intentions/motivations of the attackers. In some rare cases, the older compilers embed so-called ‘dead code’ (never executed) in the final executable, so strings extracted from such sample provide a rare intel that may help with attribution. True story: over a decade ago I have analyzed one such sample and the dead code gave away a lot of info about the attacker – that was enough to pinpoint the exact individual responsible for that particular hacking spree.

We now want to build a superset list of all features, and for each sample, put ‘yes’ if the feature is present, and ‘no’ – if it is not.

So, we first add a new column to which we will copy and paste all the filled-in cells from each sample’s column, one by one:

We then remove all duplicates from the ‘All’ column by using Data / Remove Duplicates function:

giving us this as a result:

Let me now show you what we want to build from this data: the final product we are after is a matrix of all interesting strings / features (column ‘All’) cross-referenced with each sample’s strings:

How do we build it?

We use three functions: VLOOKUP, ISNA and IF.

VLOOKUP helps us to find a given string in a specific column. If it exists, it will simply give us its value, otherwise it will return “N/A” (not available). We then use ISNA function to test the result of VLOOKUP. If it is “N/A”, we output “No”, otherwise we output “Yes”. We then add some basic cell formatting on top of it (add borders to make it look like a table, center text vertically and horizontally within the cells) + add some conditional formatting (if cell contains “Yes”, make it Blue, and if “No”, make it Red) and we get the result shown above.

These are formulas I have used in this example:

It may look complicated at first, but the logic is brutally simple:

We start by using a VLOOKUP function — we look for a string from a fixed column $E ‘All’ (dollar in front of it is to avoid it changing to F, G, H in subsequent horizontal formula copy via CTRL+R mentioned below), within the A:A column (which is sample1), and if it is found (ISNA returns false) we say ‘Yes’, otherwise, we say ‘No’. We populate this formula with CTRL+D (vertically), and CTRL+R (horizontally) and all the other cells should be now filled in with formulas like on the above picture.

When we build a matrix like this we can immediately spot some interesting bits about the samples:

  • sample4 may be the earliest as it includes `written by bored Bob`, possible OPSEC fail and attribution bit here
  • sample1 and sample3 seem to be the most advanced, with sample3 being probably the latest as it offers remoteshell capabilities; the ‘screencapture’ feature present in sample1 is not present in sample3 which could be explained as ‘it’s 2023 and video streaming is a thing aka screenshots are so 2010’

Of course, there are million other ways to do the very same task. And after all, it’s a manual and kinda mundane task to do this via Excel, but again, there are some lessons learned here:

  • by experimenting like this we build processes that can be then automated with better tools (f.ex. python, perl)
  • data presented in tables speaks to customers better than the most comprehensive reverse engineering efforts (none of them want to look at Ida or Ghidra screenshots; they want root-cause analysis, TTPs, IOCs, high-level description of features and capabilities, and what has changed between malware versions found on their systems)
  • in many cases I encountered, especially in a Linux world (ELF files), this is more than enough to pinpoint main differences between samples of the same malware family; it’s a great time saver!
  • even more interesting is another bit — for many trojanized programs or libraries (again, especially in Linux world), string-based comparisons against their clean versions often yield really great results (somehow, threat actors love to add a lot of extra debugging strings, messages, etc. that immediately stand out)

Excelling at Excel, Part 2

Today I will talk about automated query-building using Excel.

Working as a detection engineering and/or threat hunting specialist we often need to create a lot of queries including a lot of repetitive conditions that follow a very similar syntax. It’s not pretty. It’s not easy to manage. We can do better.

For instance, if our logs refer to process image names (names of the files that are used to launch processes), and we want to write a query that focuses on a bunch of well-known browser processes, and browser file names we have looks like the list below:

  • chrome.exe
  • firefox.exe
  • brave.exe
  • iexplore.exe
  • microsoftedge.exe
  • microsoftedgecp.exe
  • opera.exe
  • qqbrowser.exe
  • 360se.exe
  • 360chrome.exe
  • sogueexplorer.exe
  • maxthon.exe
  • vivaldi.exe
  • epic.exe
  • safari.exe

it may be tempting to write all these query conditions referring to all these process names manually. That is, for every process name listed, you may type something like:

image_file_name="*\\chrome.exe" OR
image_file_name="*\\firefox.exe" OR
etc.

With Excel, it is an easy task to create a formula that will generate this code for us automagically:

You type the formula in B1, then populate next rows with CTRL+D.

The result (flip between formulas and values using CTRL+`) looks like this:

There are many benefits of using Excel to simplify this mundane task – the primary benefit being that it helps us to separate input data (process names of interest) from the actual code/query. It also forces us to keep a list of all browser-related process names in a column A, and the auto-generated code automagically populates the column B. That is, anytime we add a new process name, we can autogenerate code for another condition by simply extending the formula in column B to the next row!

If you are working on automation a lot you may criticize the approach. Why do we even need Excel if we could simply keep a dedicated text file listing all the process names of interest (browsers, or whatever), and then have them ingested by a scheduled task that builds the new query on-the-fly using f.ex. a python script that combines a query template with a dynamically built list of conditions. You are spot on. If you are that far ahead, please carry on. Still, admittedly, if we want to keep everyone happy, why don’t we keep that list in a shared Excel file so it’s easy to add entries to it using a desktop application or a web-based GUI instead of vim or emacs 🙂 — the easier it is for _anyone_ to add to this list, the more engagement and participation from your team members in enriching this list you will see. And extracting data from Excel using python, perl, or whatever is supereasy. So, data input for automation can be done in the Excel, while all the gore automation bit can be done in any language we like. Win-Win.

Coming back to our Excel exercise… if you are a good observer, you have probably noticed that the last row still includes the ‘ OR’ at the end, which will obviously cause a syntax error in a larger formula (f.ex. in Splunk’s SPL). You can simply delete the last ‘ OR’ manually after copying the conditions to the Splunk query window, or improve the formula to add ‘ OR’ at the end of the line only if the next row includes a process name in a column A, f.ex.:

resulting in:

We can control the output of such automagically generated code in any way we want. It’s just a matter of writing a proper (sometimes complex) formula.

Apart from all the benefits outlined above, there is one more — by generating queries in an auto-magical way we are avoiding a lot of data entry errors. Not only we are avoiding typos, incorrectly escaped characters, we are also avoiding costly business logic errors which sometimes occur when we put one too many of, or simply miss one ‘ OR’ or ‘ AND’ in long queries — these mistakes may change the logic of the query in a very subtle way leading to many hours of troubleshooting.

The other area where Excel and queries created ‘automagically’ from the input data may be useful is a database of regular expressions. When we threat hunt we use regular expressions a lot. The most naive and traditional approach of using them relies on copypasta (sometimes done wrong!) where the input are old/existing queries, often not even ours, and we can of course do better than that.

What if we designate a single Excel sheet that stores them all, in a plain-vanilla form, peer-reviewed though, and then we can cherry-pick them from the list as we wish, knowing they are solid, reliable, tested, and we may also ensure the sheet includes sample queries relying on these regular expressions, f.ex.:

In the interest of full disclosure — I created this just for this article, so there may be some typos, but if you look at this, you probably get the idea. One place for all regexes -> less chances in reinventing the wheel anytime we come across the same problem data parsing/extraction in the future.

There are two classes of regular expressions listed on the screenshot:

  • the top ones are used for tokenization that relies on replacing a part of a data input with a token — very useful when we want to normalize some data f.ex. paths or process command line arguments (f.ex. in Least-Frequency Analysis)
  • the bottom ones are extraction regular expressions — they take the input and extract a part of it giving us more context or more meaningful way to analyze data

If you are curious how the entries shown above look like in ‘formulas’ mode (CTRL+`), have a look at this:

Again, you may have an immediate knee-jerk reaction now telling you that it’s silly to do it this way, and there are better, more efficient ways to do it. And again I must say — you are way ahead — and this article is probably not for you 🙂

Using Excel right helps us to build processes at first, and then build better processes. These then can be automated, and they really should be.