Not installing the installers, part 2

In the last post I described how we can pull some interesting metadata from decompiled installers. Today I want to discuss one practical example of how this data can enrich analysis, both manual and automatic (f.ex. sandbox, EDR).

Many programs cannot be properly analyzed by sandboxes, because they require command line arguments. While command line options for native Windows OS binaries are usually well documented (well, not really, there is a lot of undocumented stuff, but let’s forget about it for a second), command line options used by goodware is a completely different story. And of course, even worse for malware.

The good news about goodware is that they handle command line arguments in a very predictable way. The string comparisons are usually ‘naive’, direct and not optimized, and often, the programs include actual help that can be seen after running the program with the /?, /help, -h, –help arguments. And very often, a search for ‘usage’ keyword inside the binary can help us to find the options that program recognizes. f.ex. this is what we see inside cscript.exe:

Predictable is good, and can serve at least a few purposes:

  • we can generate a list of known parameter strings that goodware typically uses (and even attribute it to specific software vendors)
  • we can create yara signatures for these
  • we can incorporate this set into EDR command line parsing routines to assess invocation’s similarity to known good
  • we can also leverage this to run the sample in a sandbox or manually with these easily discovered command line arguments (kinda like assisted fuzzing)
  • we can include these findings in the report itself to hint analysts they may need to do some manual reversing (it would seem the program accepts the following command line arguments…)

Looking for typical command line arguments is actually quite difficult. There are a lot of ways to implement string comparisons and as I explained long time ago in one of my sandbox series, there are like gazillion different string functions out there. Plus different compilers, different optimizations make the code even harder to comprehend. Naive search for /[a-zA-Z_0-9]_/ could work on a binary level, but this is going to hit a lot of FPs. Decompiled scripts can come to the rescue, as they include actual invocations of programs and specify precisely what parameters will be passed to the program.

The attached list focuses on a basic command line argument extraction (just the /foo part) from around 10K decompiled scripts. More advanced analysis would include options taking parameters (f.ex. /foo bar).

You can download it from here.

Hijacking HijackThis

Long before endpoint event logging became a norm it was incredibly difficult to collect information about popular processes, services, paths, CLSIDs, etc.. Antivirus companies, and later sandbox companies had tones of such metadata, but an average Joe could only dream about it.

This is where HijackThis came to play. At a certain point in history, lots of people were using it and were posting its logs on forums – for hobbyist malware analysts to review. And since HijackThis Log has a very specific ‘look and feel’, it was pretty easy to parse it. And find it.

In order to collect as many logs as possible, I wrote a simple crawler that would google around for very specific keywords, collect the results, then visit the pages, download them to a file, and parse the result. Each session would end up with a file like this:

[Processes - Full Path names]
[Processes - Names]
[Directories]
[All URLs]
[Registry - Full Path names]
[Registry - Names]
[Registry - Values]
[BAD URLs]
[CLSIDs]

There are plenty of uses for the collected data — one of the handy ones back then was a comprehensive list of CLSIDs — knowing these, you could incorporate these into a simple binary/string signature and search for them inside analyzed samples. If a given, specific CLSID was found, it was quite easy to ID the sample association or at least, some of its features. Another interesting list of artifacts is rundll32.exe invocations. There are many legitimate ones and it’s nice to be able to query them all and put them together on a ‘clean’ list. Of course, URLs are always a good source for downloads, and directories and paths, as well as registry entries and process/service list handy for generating statistics on which paths are normal and which are not. A list of ‘known clean’ that could be a foundation for a more advanced version of Least Frequency Occurrence (LFO) analysis. And even browsing file paths is an interesting exercise as well as – for example, it allowed me to collect information about many possible file names of interest (f.ex. these that could be used in anti-* tricks).

I had a lot of ideas around that time on incorporating these research ideas into my forensic analysis workflow. For instance, if we knew certain paths are very prevalent, it kinda makes sense to exclude them from analysis. Same goes for other artifacts. And a twin idea from around that time was filelighting – it’s common for software directories to include a list of files that are referenced in at least one of the other files. That is, if I find a file foo.bar inside program directory, there is a high possibility that at least one of the other files – be it executables, or configuration files – will reference that foo.bar file! It actually works quite well. And the main deliverable of this idea was that if we can find orphaned files, they are suspicious. And, from a different angle, if we know what clusters belong to what software package, we could use that tree of self-referencing file names to eliminate them from analysis.

Times have changed, of course, and while these ideas may still have some value, reality is that we live in a completely different world today.

In the end, I cannot say the database helped me a lot, but it was an interesting exercise, and since the data is quite obsolete by now I decided to drop its content online. It’s not a very clean data set, mind you. You will find errors in parsing, some HJT logs were truncated, referred to non-English characters, etc. Still, maybe you will find some use for it. Good luck!

Download it here.