Clustering and Batch Analysis of APT1 sampleset, part 2

Part 1, Part 2, Part 3

In my last post, I presented results of the batch analysis and clustering attempts of the APT1 sampleset. Today, I will continue on the topic of clustering, this time looking at the problem from a different angle. Again, results are not mind blowing, but it’s an experiment and it’s not about the destination, but about the journey 😉

Typical programs, even if recompiled or rebuilt with different configuration often preserve their internal structure. It is partially a result of ‘if it works, don’t touch’ approach, programmer’s laziness, his/her coding habits, a side effect of the ‘business logic’ implemented in the program and many other factors. One can compare a patched file and immediately spot the changes; or, pick up a a characteristic sequence of code or data and search for similar sequences in the whole sampleset. The fact that it is possible to compare programs on a binary level is well known and pretty much every 0day hunter has at some stage used or is currently using this technique in their bughunting adventures (using tools like BinNavi).

In a context of sample clustering, I think we don’t necessarily need to go as far as in-depth binary code comparison – there are a lot of shortcuts we can take here. The easiest is to pick up these code sequences that refer to strings.  And to narrow down the scope for this post, we only look at string comparisons. They are used for parsing of command line arguments, RAT/bot commands, data sent over the protocols and so on and so forth.

There are a few ways compilers implement string comparisons on a machine code level – based on a quick research and already done clustering on the APT1 sampleset, I know that plenty of string comparisons are done using just 5 functions: _strnicmp, memcmp, strcmp, strncmp, strstr.

In order to look at the code sequences of this sort we need to find a better way of extracting strings from the samples, because a typical static analysis tool extracts strings from the file in a ‘dumb’ way and doesn’t provide the necessary programming/algorithmic context. Using a tool like PESectionExtractor doesn’t help either as the context it provides is related only to a physical location of the string, and doesn’t tell us how the string is being used by the program. What we need is a tool that can disassemble the code properly or even better – decompile it. And the obvious choice here is Ida Pro with its Hex-Rays plugin.

The resulting files are a bit too large for the direct copy and paste in this post, so I am providing direct links to text files below: