Excelling at Excel, Part 4

Excel is the emperor of automation. Not the SOAR type, but the local one – yours.

Why?

Its formulas and VBA capabilities can turn many awfully mundane tasks into plenty of automation opportunities…

For instance… certain programming tasks.

The case/switch syntax is a beautiful construct. It allows us to define a large set of complex if/then statements in a very elegant way. It is very often used to split a data/value set into conditions that determine the result/output/outcome based on the input.

Now, there are some programming languages that do not support case/switch statements well (not a part of its functional specification, introduced in late versions hence not fully compatible, etc.).

Writing case/switch statements for these languages is TOUGH. This is because their limitations often force us to rely on a bunch of NESTED IF clauses… Writing many of these, nested, is not for the faint-hearted. Typos, incorrect number of opening or closing parenthesis, and overall — getting lost in the complexity of this nested logic is very easy.

And this is where Excel can help a lot.

Imagine a hypothetical scenario where you have to write a code that gets this input data (names that are a loose set of Value1, Value2, Value3) to generate the output (A, B, C, D, E) based on peoples’ names:

It’s pretty straightforward in case/switch supporting programming languages to define rules based on this set, but if the only thing available are nested ifs – it’s much harder.

Let’s try though…

The first idea we can tackle is to convert these 3 values (some are empty) to an array of per-row list of values:

You will admit that the List column looks more ‘programmer friendly’ now.

The formula that produces the values in the List column is obscenely simple:

We simply build a parenthesis-embraced list of values where the first one is always present, and the two others are added only if the respective cells are not empty.

Trivial!

But how do we convert this list of values into a programmatic construct that gives us A, B, C, D or E depending on the name (one of these three values/per row)?

This is one way to do it (using ternary operator):

(if (name in ('John', 'Jack', 'Anne') ? 'A' : (if (name in ('Peter') ? 'B' : (if (name in ('Kate', 'Leo') ? 'C' : (if (name in ('Paul', 'Ariel', 'Fyodor') ? 'D' : (if (name in ('Amy', 'Maria') ? 'E' : "N/A"))))))))))

If you are curious where the formula comes from — it is from the very same spreadsheet – it is the value of the F2 cell:

And how did we build this one?

This is how:

What we do here is building a basic ternary logic where we take the one part of the comparison from the current row, and then the alternative is taken from the row below. In the end, the sort-of recursion happens and we end up with a sequence of nested ternary operators doing its work.

It may be a bit surprising, but using Excel for building complex logical statements like the one above, the one that in the end can be pasted directly to your favorite programming editor is actually very easy…

You benefit from the fact the input data is saved in Excel format and is easy to edit, plus – as long as the formulas are correct – the resulting nested constructs are written (generated) in a syntactically correct way and with far less chances to introduce a basic typo error.

Excelling at Excel, Part 3

One of the most common use cases we come across during our malware analysis exercises is a ROI-driven comparison of features between many samples of the same malware family. Yes, we can use BinDiff, Diaphora (and we should), but if it is a time-sensitive research, we need to take some shortcuts to deliver early results pronto.

Here’s one way to do it.

Note: I have used this approach many times in the past, because it’s simple, easy to understand, and produces a visual that is VERY nice to include in your deliverables. Your customers do not want to see the lengthy reports including all gore metadata and strings extracted from each and every sample. They want to see THE STORY.

A high-level matrix showing a difference between different versions of malware can be often quickly built by looking at, and comparing strings extracted from multiple samples (I simplify it a bit here: yes, in many cases you may need to decrypt these strings first, but it’s a reversing task we are all used to, so I won’t be covering it here).

So, how do we go about it?

After the basic triage, we extract strings selectively from all the samples of malware we have and we put them all in Excel. Like this:

We now have a simple database of ‘features’ or unique capabilities of each sample. There are similarities between them that are immediately striking, plus, some features seem to exist across many samples, and some are just one-offs. As a side note: all these one-offs are VERY interesting as they often are examples of bad OPSEC, and/or may reveal some of the additional intentions/motivations of the attackers. In some rare cases, the older compilers embed so-called ‘dead code’ (never executed) in the final executable, so strings extracted from such sample provide a rare intel that may help with attribution. True story: over a decade ago I have analyzed one such sample and the dead code gave away a lot of info about the attacker – that was enough to pinpoint the exact individual responsible for that particular hacking spree.

We now want to build a superset list of all features, and for each sample, put ‘yes’ if the feature is present, and ‘no’ – if it is not.

So, we first add a new column to which we will copy and paste all the filled-in cells from each sample’s column, one by one:

We then remove all duplicates from the ‘All’ column by using Data / Remove Duplicates function:

giving us this as a result:

Let me now show you what we want to build from this data: the final product we are after is a matrix of all interesting strings / features (column ‘All’) cross-referenced with each sample’s strings:

How do we build it?

We use three functions: VLOOKUP, ISNA and IF.

VLOOKUP helps us to find a given string in a specific column. If it exists, it will simply give us its value, otherwise it will return “N/A” (not available). We then use ISNA function to test the result of VLOOKUP. If it is “N/A”, we output “No”, otherwise we output “Yes”. We then add some basic cell formatting on top of it (add borders to make it look like a table, center text vertically and horizontally within the cells) + add some conditional formatting (if cell contains “Yes”, make it Blue, and if “No”, make it Red) and we get the result shown above.

These are formulas I have used in this example:

It may look complicated at first, but the logic is brutally simple:

We start by using a VLOOKUP function — we look for a string from a fixed column $E ‘All’ (dollar in front of it is to avoid it changing to F, G, H in subsequent horizontal formula copy via CTRL+R mentioned below), within the A:A column (which is sample1), and if it is found (ISNA returns false) we say ‘Yes’, otherwise, we say ‘No’. We populate this formula with CTRL+D (vertically), and CTRL+R (horizontally) and all the other cells should be now filled in with formulas like on the above picture.

When we build a matrix like this we can immediately spot some interesting bits about the samples:

  • sample4 may be the earliest as it includes `written by bored Bob`, possible OPSEC fail and attribution bit here
  • sample1 and sample3 seem to be the most advanced, with sample3 being probably the latest as it offers remoteshell capabilities; the ‘screencapture’ feature present in sample1 is not present in sample3 which could be explained as ‘it’s 2023 and video streaming is a thing aka screenshots are so 2010’

Of course, there are million other ways to do the very same task. And after all, it’s a manual and kinda mundane task to do this via Excel, but again, there are some lessons learned here:

  • by experimenting like this we build processes that can be then automated with better tools (f.ex. python, perl)
  • data presented in tables speaks to customers better than the most comprehensive reverse engineering efforts (none of them want to look at Ida or Ghidra screenshots; they want root-cause analysis, TTPs, IOCs, high-level description of features and capabilities, and what has changed between malware versions found on their systems)
  • in many cases I encountered, especially in a Linux world (ELF files), this is more than enough to pinpoint main differences between samples of the same malware family; it’s a great time saver!
  • even more interesting is another bit — for many trojanized programs or libraries (again, especially in Linux world), string-based comparisons against their clean versions often yield really great results (somehow, threat actors love to add a lot of extra debugging strings, messages, etc. that immediately stand out)