Good file… (What is it good for) Part 2

This series talks about ‘good’ files. That is, files (samples) produced by reputable vendors, often signed, and hopefully not compromised by stolen certificates, vulnerabilities, supply-chain attacks or bothered by other err… minor inconveniences :-).

Say you have amassed a bunch of ‘good files and declare your first ‘goodware’ collection as ‘ready for processing’.

What do you do with it?

The easiest way to start processing this sampleset is by applying advanced file typing first. We want to know what files we collected, and mind you, I mean not only a distinction between your random media file (jpeg, gif) and PE/ELF/MACH-O, but also binary vs. scripting, 32- vs 64-bit, EXE vs DLL, user mode vs. kernelmode, standalone exe vs. .NET, signed vs unsigned, standalone vs installer, installer/packer/protector types: autoit, pyinstaller, perl2exe, legacy PE file protection layers (mpress, pecompact, themida, etc.), MSI, and gazillion of existing installer types that typically store the installation information as a compressed/encrypted appended data behind the generic executable installer stub (Nullsoft, InnoSetup, Wise, etc.).

To perform this task I use a combination of DiE and my own spaghetti-code script that I have been improving over last 17 years (sorry, not for sharing, it’s absolutely disgusting!).

Once we know what we are dealing with we can try to unpack stuff.

Why unpack you may ask?

Because what you download from vendors’ sites is often installers that internally store many additional files, many of which are OF INTEREST. Yup, additional embedded installers, standalone EXE/DLL/OCX/SYS, redistributables, etc.

How to do it?

The 7-zip is a natural candidate, but we need to be careful. I suspect NSRL analysts use it extensively and they unpack everything that has ‘a binary pulse’ recursively until 7z returns error. As a result, they get tones of executable file sections’ metadata sneaking in to their final hash set, and… in the end no one wins.

The other natural candidate is Universal Extractor (and the updated versions of it f.ex. Universal Extractor 2). You can’t win this battle either, because it’s too complex and you lose control of what goes unpacked and what ends up in your final metadata set, same as with an overzealous use of 7z.

We should definitely use these tools though, just need to apply some moderation.

For instance, we can disable extraction of PE files internals by 7-zip by using its stx command line argument:

7z x -stxPE foo.exe

With that, you could run this recursively on all files (including traversing unpacked directories) and get a decent list of internal files, but without unpacking PE files!

For Universal Extractor-based tools their best use is … analysis of their code. You will find info on both the syntax and required toolkit information that helps to unpack less common archive formats. It’s the best way to study their handling of particular file formats/installers and how to unpack them so that we can then cherry-pick these that work for us. Again, we don’t want all, as you will end up unpacking lots of useless information e.g. .MSI files and generating a lot of poor quality metadata. Be choosy.

For InnoSetup there is InnoUnp.

For Nullsoft, there is a 7z version 15.05 that extracts Nullsoft installer files very neatly, including the [NSIS].nsi file that is a decently reproduced Nullsoft Installation Script!

For AutoIT executables you can use ClamAV as pointed out by @SmugYeti.

Yup, you have to analyze your advanced file typing results first and then … divide and conquer.

So…

Imagine you have file typed all the samples, you know how to unpack them, what’s next?

I think there are two ways to go about the next step – it starts with a script picking up a single file from your repository, and:

  • recursively unpacking it and its subsequent ‘descendant files’, advance file-type them and at the end copy files of interest (PE, MSI, etc.) back to repository
  • unpack first layer only, then copy files of interest back to repository for further processing

I think both approaches have advantages, with the first one being probably the ‘smartest’ (i.e. do it once, well), and the second more optimized for resources usage. Why? For source files being often 200MB in size and more (yes, there are plenty of such installers nowadays!) the whole ‘extracted files & directories’ tree may end up being a good few, even few dozens of GB of data! And it simply implies a necessity of using the slow HDD as ‘a working space’.

A side note #1 here: I am talking of small SOHO hardware investments here! Note #2, I also don’t advocate using SSD in your SOHO ‘sample processing’ setup as they tend to fail after too many I/O operations. Had too many issues in the past and don’t recommend.

In the second approach, RAM disk may be good enough most of the time and it’s definitely better, performance-wise.

Choose your poison wisely.

I actually use a hybrid approach – if the file is relatively small I try to unpack it fully on the RAM drive, and if it is an obvious ‘fat installer’ I push it to HDD. And I always try to do ‘a full-recursive’. Saves time and is kinda neat. Again, your personal choice matters here.

So… now you extracted, distributed and catalogued all this goodness.

What’s next?

For starters, keep all logs so you can troubleshoot issues, and then… this is a series, so there will be another post 🙂

Good file… (What is it good for) Part 1

Most of (anti-) malware researchers focus on malware samples, because… it’s only natural in this line of work. For a while now I try to focus on the opposite – the good, ‘clean’ files (primarily PE file format). While it may sound boring&mundane, maybe even somehow trivial, this is actually a very difficult task!

Why? Hear me out!

There are no samplesets available out there (at least that I know of), and any samplesets expire really fast (who cares about drivers for XP or Vista, or even any 32-bit files anymore). The ‘availability’ bit is tricky too (apart from some drivers distros originating mostly from Russia it’s hard to download anything ‘in bulk’), and yes… in the end you are pretty much on your own when you want to collect some new ‘good’ samples…

And ‘good’ companies generate a lot of these… and many of them are not even interesting for us, and… you may ask yourself… what all these good files are really good for?

From the offensive perspective — it’s easy: find good files, see if they are vulnerable, find these that are, write POC exploits & either submit CVEs or sell 0days to exploit brokers. Oh wait, it’s not really ‘good’, is it? Let’s sit that one on a fence for the time being.

What about ‘le’ defense?

Basic analysis of any clean Windows sampleset from last 20 years can tell us that the most common number of PE sections inside these ‘good’ executable files is 5 (31%). Followed by 4 and 6 (both 13%), then 3 (12%) and 2 (10%), and 1 (5%).

Note: like with all statistics, these % numbers are not to be trusted, because it’s from a relatively small set of clean files, many of which are from the PAST (2000-2020). Still, it’s something we can at least initiate a conversation with, right? And I doubt the percentages will vary much in larger samplesets, because good files are what they are — something that is a product of compilers and they tend to follow a template…

We can exploit that. And we should.

My hypothesis is that no matter what cluster of samples we look at, most of good PE files will oscillate around that 5 PE sections mark by default. Oh… wait.. Newer compilers may actually shift that number a bit higher – this is because of inclusion of additional sections that we now see added ‘by default’ e.g. ‘.pdata’ and ‘.didat’ sections. And to bring up a good example here — Windows 10’s Calculator (stub) has 5, and Notepad has 7 sections.

So… 5..7 range it is.

Anything outside of it is probably… mildly interesting. Why ‘mildly’? These numbers are good for Microsoft compilers/PE files, but files built with non-Microsoft compilers will have to fall into a different bucket. Compiler detection is critical here and only if we do so, we can correlate average number of PE sections in ‘good’ files generated by that specific compiler. Think Delphi/Embarcadero, mingw, Go, Nim, Zig, Rust, PyInstaller, etc.. Non-trivial 🙁

PE Sections are for beginners tho. Really no point spending much time on them, because the PE profiling landscape has changed a lot over last 15 years. Luckily, there are many other properties of ‘good’ files that are worth discussing.

First, the PDB paths. Same as with malware, we can collect a large corpora of these and create a cluster of ‘reverse-logic’ yara rules. That is, if the yara hits and the file contains one of the legitimate-looking PDB file names/paths it is most likely good! It is a terribly naive assumption of course, I mean to believe that all files with legitimate PDB paths are good, but… why not, for starters.

For instance, if the file is not detected by any AV, and contains unique PDB strings that look like one of these clean PDB paths (on a curated list) then it’s highly possible it is, indeed a clean one! Right?

And together with other characteristic of good PE files we may craft a little bit more complex yara rules that could all be good indicators of file goodness w/o losing flexibility (e.g. work across multiple versions of the same file).

I don’t want to burn ‘good’ yara rules in public, because this kills the whole idea described above, so I won’t be posting too many examples (more about it later as well), but let’s have a look at this PDB path:

If it is was malware, you would write a Yara signature for it, right?

You can do the same for ‘good’ files.

The second idea is focused on GUIDs.

Many clean files come as COM libraries and their GUIDs referenced by their type libraries are unique. One could create another set of ‘reverse-logic’ yara rules for these, and this way discover ‘good, clean files that reference them. It could be an occurrence of GUID in a string format, either ANSI or Unicode as well as its binary representation.

Again, the assumption here is that bad guys don’t use the same GUIDs in their poly-/meta-morphic generators (yet). See it for yourself – the below GUID has only few Google hits and (until this post) was a good indicator of ‘goodness’:

Next, we can look at resources. Many legitimate executable files embed ‘branded’ icons. We know these are already being leveraged by some bad guys, but having a large set of these ‘good’ icons extracted from many clean samples can help to push samples that include them into different pipelines.

And these, especially, if matched with characteristics of import/export tables, their hashes, or other basic file properties like size, number of sections and their names, or even a matching subset of strings, plus version information, number of localized strings, entropy, signatures, etc. can form unique descriptors of ‘goodness’.

Can these be abused? 100% . This is why I am not making results of this research public.

And it is for sure that ‘good’ files follow patterns, they don’t change that much and we can exploit that. And since these properties can be extracted automatically, this is in fact, a great place for machine learning (unlike actual malware)! What if what we need is a ML/AI algo that learns from ‘good files’ ? Yes, it’s not a new concept, but how much of this kinda research is actually made public (especially the algorithmic part)? With this series I plan to bring some of my personal research to the public eye with a hope it can inspire more work in this space.

And coming back to what I mentioned earlier – I do face a dilemma. I have collected many of these artifacts and statistics during my spelunking, but I don’t think it’s a good idea to share them publicly. I think there actually is a scope for a DST debate same as there is for OST and at this moment in time I believe some artifacts or their collections, “defensive” findings if you will, should be shared within trusted circles only!!!

Yes, it’s a 180 degree change of my stance compared to say 10 years ago, but we live in strange times. If I publish a list of all clean PDB paths, clean GUIDs, clusters of legitimate icons it’s a given that the next generation of malware will immediately re-use them in their creations! And some Red Teams may use that too.

So, it’s a No.

I am also worried about unfair players in the corporate space who will simply acquire this data for free and use it in their commercial offerings, both on defensive and offensive side.

So, this is a No, too.

Yup, good sharing times are over, sorry.

Where does it leave us?

I guess there is not much that can be done here…