Good file… (What is it good for) Part 2

March 11, 2022 in File Formats ZOO, GoodWare

This series talks about ‘good’ files. That is, files (samples) produced by reputable vendors, often signed, and hopefully not compromised by stolen certificates, vulnerabilities, supply-chain attacks or bothered by other err… minor inconveniences :-).

Say you have amassed a bunch of ‘good files and declare your first ‘goodware’ collection as ‘ready for processing’.

What do you do with it?

The easiest way to start processing this sampleset is by applying advanced file typing first. We want to know what files we collected, and mind you, I mean not only a distinction between your random media file (jpeg, gif) and PE/ELF/MACH-O, but also binary vs. scripting, 32- vs 64-bit, EXE vs DLL, user mode vs. kernelmode, standalone exe vs. .NET, signed vs unsigned, standalone vs installer, installer/packer/protector types: autoit, pyinstaller, perl2exe, legacy PE file protection layers (mpress, pecompact, themida, etc.), MSI, and gazillion of existing installer types that typically store the installation information as a compressed/encrypted appended data behind the generic executable installer stub (Nullsoft, InnoSetup, Wise, etc.).

To perform this task I use a combination of DiE and my own spaghetti-code script that I have been improving over last 17 years (sorry, not for sharing, it’s absolutely disgusting!).

Once we know what we are dealing with we can try to unpack stuff.

Why unpack you may ask?

Because what you download from vendors’ sites is often installers that internally store many additional files, many of which are OF INTEREST. Yup, additional embedded installers, standalone EXE/DLL/OCX/SYS, redistributables, etc.

How to do it?

The 7-zip is a natural candidate, but we need to be careful. I suspect NSRL analysts use it extensively and they unpack everything that has ‘a binary pulse’ recursively until 7z returns error. As a result, they get tones of executable file sections’ metadata sneaking in to their final hash set, and… in the end no one wins.

The other natural candidate is Universal Extractor (and the updated versions of it f.ex. Universal Extractor 2). You can’t win this battle either, because it’s too complex and you lose control of what goes unpacked and what ends up in your final metadata set, same as with an overzealous use of 7z.

We should definitely use these tools though, just need to apply some moderation.

For instance, we can disable extraction of PE files internals by 7-zip by using its stx command line argument:

7z x -stxPE foo.exe

With that, you could run this recursively on all files (including traversing unpacked directories) and get a decent list of internal files, but without unpacking PE files!

For Universal Extractor-based tools their best use is … analysis of their code. You will find info on both the syntax and required toolkit information that helps to unpack less common archive formats. It’s the best way to study their handling of particular file formats/installers and how to unpack them so that we can then cherry-pick these that work for us. Again, we don’t want all, as you will end up unpacking lots of useless information e.g. .MSI files and generating a lot of poor quality metadata. Be choosy.

For InnoSetup there is InnoUnp.

For Nullsoft, there is a 7z version 15.05 that extracts Nullsoft installer files very neatly, including the [NSIS].nsi file that is a decently reproduced Nullsoft Installation Script!

For AutoIT executables you can use ClamAV as pointed out by @SmugYeti.

Yup, you have to analyze your advanced file typing results first and then … divide and conquer.


Imagine you have file typed all the samples, you know how to unpack them, what’s next?

I think there are two ways to go about the next step – it starts with a script picking up a single file from your repository, and:

  • recursively unpacking it and its subsequent ‘descendant files’, advance file-type them and at the end copy files of interest (PE, MSI, etc.) back to repository
  • unpack first layer only, then copy files of interest back to repository for further processing

I think both approaches have advantages, with the first one being probably the ‘smartest’ (i.e. do it once, well), and the second more optimized for resources usage. Why? For source files being often 200MB in size and more (yes, there are plenty of such installers nowadays!) the whole ‘extracted files & directories’ tree may end up being a good few, even few dozens of GB of data! And it simply implies a necessity of using the slow HDD as ‘a working space’.

A side note #1 here: I am talking of small SOHO hardware investments here! Note #2, I also don’t advocate using SSD in your SOHO ‘sample processing’ setup as they tend to fail after too many I/O operations. Had too many issues in the past and don’t recommend.

In the second approach, RAM disk may be good enough most of the time and it’s definitely better, performance-wise.

Choose your poison wisely.

I actually use a hybrid approach – if the file is relatively small I try to unpack it fully on the RAM drive, and if it is an obvious ‘fat installer’ I push it to HDD. And I always try to do ‘a full-recursive’. Saves time and is kinda neat. Again, your personal choice matters here.

So… now you extracted, distributed and catalogued all this goodness.

What’s next?

For starters, keep all logs so you can troubleshoot issues, and then… this is a series, so there will be another post 🙂

Comments are closed.