Good fileโ€ฆ (What is it good for) Part 3

We have our sampleset. We have our metadata.

What’s next?

You can very quickly script searches that will look for specific files, or their properties. I mentioned section names, PDB paths, icons, but there is more.

In my older blog post I highlighted a presence of a copy&paste crypto code block present in a number of ‘good’ files I have looked at. The reason I recognized these samples is pretty simple: they used API calls that happen to be on the list of all APIs that are of my interest.


Once you get your sampleset processed, generated its preliminary metadata, you can look at the properties extracted from the files and/or either re-run some additional metadata collection tasks (often narrowed down to specific file types), and/or … disassemble/decompile these files for some quick code-based post-processing & … quick wins. The above example is one of them.

What these post-processing tasks could be?

Running yara, capa is of course a ‘must-do’ and it’s trivial. But that’s not everything. You can (and should) run instrumented oleview, or olewoo to extract additional OLE/COM info from embedded type libraries. They not only give us info about unique GUIDs, but often point us to proprietary COM interfaces and methods of interest that could be used to do some ‘funny’ stuff – think: file downloading, program execution, escalation of privileges, etc.

The next targets are drivers – they themselves are of always of interest, because ‘who runs code in kernel – owns the box’. @hFirefox (Twitter account doesn’t exist anymore as of today) created a number of POCs showing that legitimate, signed, yet vulnerable kernel drivers can be abused to deliver payloads of (user mode) choice. There is definitely more vulnerable drivers out there, and guess what… some basic kernel driver analysis can be also done statically.

If you look at a number of them, you will start recognizing a lot of patterns. For instance, it’s not uncommon for kernel mode driver developers to include the whole list of debug messages that often help understanding the internals of their creations. To be precise… for instance, grepping for ‘IOCTL_’ prefixed strings inside kernel drivers will give us a lot of hints with regards to what the driver does, and how it operates. And yes, it will give us names of many IOCTLs as well!


We can bulk analyze these.


Yes, I will cover this soon ๐Ÿ™‚

Good file… (What is it good for) Part 2

This series talks about ‘good’ files. That is, files (samples) produced by reputable vendors, often signed, and hopefully not compromised by stolen certificates, vulnerabilities, supply-chain attacks or bothered by other err… minor inconveniences :-).

Say you have amassed a bunch of ‘good files and declare your first ‘goodware’ collection as ‘ready for processing’.

What do you do with it?

The easiest way to start processing this sampleset is by applying advanced file typing first. We want to know what files we collected, and mind you, I mean not only a distinction between your random media file (jpeg, gif) and PE/ELF/MACH-O, but also binary vs. scripting, 32- vs 64-bit, EXE vs DLL, user mode vs. kernelmode, standalone exe vs. .NET, signed vs unsigned, standalone vs installer, installer/packer/protector types: autoit, pyinstaller, perl2exe, legacy PE file protection layers (mpress, pecompact, themida, etc.), MSI, and gazillion of existing installer types that typically store the installation information as a compressed/encrypted appended data behind the generic executable installer stub (Nullsoft, InnoSetup, Wise, etc.).

To perform this task I use a combination of DiE and my own spaghetti-code script that I have been improving over last 17 years (sorry, not for sharing, it’s absolutely disgusting!).

Once we know what we are dealing with we can try to unpack stuff.

Why unpack you may ask?

Because what you download from vendors’ sites is often installers that internally store many additional files, many of which are OF INTEREST. Yup, additional embedded installers, standalone EXE/DLL/OCX/SYS, redistributables, etc.

How to do it?

The 7-zip is a natural candidate, but we need to be careful. I suspect NSRL analysts use it extensively and they unpack everything that has ‘a binary pulse’ recursively until 7z returns error. As a result, they get tones of executable file sections’ metadata sneaking in to their final hash set, and… in the end no one wins.

The other natural candidate is Universal Extractor (and the updated versions of it f.ex. Universal Extractor 2). You can’t win this battle either, because it’s too complex and you lose control of what goes unpacked and what ends up in your final metadata set, same as with an overzealous use of 7z.

We should definitely use these tools though, just need to apply some moderation.

For instance, we can disable extraction of PE files internals by 7-zip by using its stx command line argument:

7z x -stxPE foo.exe

With that, you could run this recursively on all files (including traversing unpacked directories) and get a decent list of internal files, but without unpacking PE files!

For Universal Extractor-based tools their best use is … analysis of their code. You will find info on both the syntax and required toolkit information that helps to unpack less common archive formats. It’s the best way to study their handling of particular file formats/installers and how to unpack them so that we can then cherry-pick these that work for us. Again, we don’t want all, as you will end up unpacking lots of useless information e.g. .MSI files and generating a lot of poor quality metadata. Be choosy.

For InnoSetup there is InnoUnp.

For Nullsoft, there is a 7z version 15.05 that extracts Nullsoft installer files very neatly, including the [NSIS].nsi file that is a decently reproduced Nullsoft Installation Script!

For AutoIT executables you can use ClamAV as pointed out by @SmugYeti.

Yup, you have to analyze your advanced file typing results first and then … divide and conquer.


Imagine you have file typed all the samples, you know how to unpack them, what’s next?

I think there are two ways to go about the next step – it starts with a script picking up a single file from your repository, and:

  • recursively unpacking it and its subsequent ‘descendant files’, advance file-type them and at the end copy files of interest (PE, MSI, etc.) back to repository
  • unpack first layer only, then copy files of interest back to repository for further processing

I think both approaches have advantages, with the first one being probably the ‘smartest’ (i.e. do it once, well), and the second more optimized for resources usage. Why? For source files being often 200MB in size and more (yes, there are plenty of such installers nowadays!) the whole ‘extracted files & directories’ tree may end up being a good few, even few dozens of GB of data! And it simply implies a necessity of using the slow HDD as ‘a working space’.

A side note #1 here: I am talking of small SOHO hardware investments here! Note #2, I also don’t advocate using SSD in your SOHO ‘sample processing’ setup as they tend to fail after too many I/O operations. Had too many issues in the past and don’t recommend.

In the second approach, RAM disk may be good enough most of the time and it’s definitely better, performance-wise.

Choose your poison wisely.

I actually use a hybrid approach – if the file is relatively small I try to unpack it fully on the RAM drive, and if it is an obvious ‘fat installer’ I push it to HDD. And I always try to do ‘a full-recursive’. Saves time and is kinda neat. Again, your personal choice matters here.

So… now you extracted, distributed and catalogued all this goodness.

What’s next?

For starters, keep all logs so you can troubleshoot issues, and then… this is a series, so there will be another post ๐Ÿ™‚