Analysing NSRL data set for fun and because… curious, Part 2

This is the second post discussing what we can find inside the NSRL data set.

At this stage we know it’s not only file hashes, but also sections of executables and java .class files stored inside JAR files. Digging a bit more in the file name statistics we find there is another subset of hashes that is quite substantial: MSI tables. They happen to be named in a very specific way i.e. with an exclamation point as a prefix. There are 350K entries like this:

35666 "!_StringData"
26126 "!_StringPool"
14844 "!File"
11991 "!Property"
11530 "!Error"
9987 "!Media"
9545 "!MsiFileHash"
9063 "!Feature"
9026 "!InstallExecuteSequence"
8949 "!Component"
8844 "!Registry"
8822 "!CustomAction"
8781 "!Directory"
8702 "!FeatureComponents"
7535 "!UIText"
7080 "!_Columns"
7009 "!_Tables"
6609 "!Binary"
5837 "!Control"
5540 "!AdvtExecuteSequence"
5528 "!Upgrade"
5500 "!_Validation"
5236 "!RegLocator"
5174 "!ActionText"
5085 "!InstallUISequence"
4843 "!AdminExecuteSequence"
4704 "!CreateFolder"
4638 "!Dialog"
4606 "!RadioButton"
4087 "!AppSearch"

While it may look like not a lot, if we exclude all file names that start with an exclamation mark, dot (a bit unfair, but a good estimate for section names), and .class files we drop the number of entries by nearly 30%:

192,677,750 - all
 57,390,395 - !<filename>, .<filename>, <filename>.class
135,287,355 - excluding !<filename>, .<filename>, <filename>.class

We could further narrow it down by excluding filenames starting with bracketed numbers f.ex. [5]SummaryInformation or underscores f.ex. __DATA__la_symbol_ptr. There is also a substantial number of filenames that are just numbers, numbers with media file extensions (f.ex. 1494.bmp) or are one way or another related to executable resources (manifest.txt, version.txt, CERTIFICATE, VERSION, etc.).

Another area of interest are files that will be most likely always uniquely bound to the NSRL test systems where they were generated and will never appear on other systems f.ex. files with the following properties. <filename>.pyc, <filename>.pyo, .gitignore.

Kudos to whoever is responsible for maintaining the NSRL set. It is an incredibly difficult task to build a list of good hashes. It’s tempting to unpack, decompile, debundle everything and sometimes this activity may just generate a little bit too much noise. I hope that unless I missed something, future versions of the set will include a flag for each entry to indicate whether the file is an embedded resource or a regular file system object. And another useful entry would be a parent. So that one could rebuild a tree of parent-child relationships leading it back to original ancestor file f.ex. if we start with a .msi file that includes a .zip file that includes a bunch of PE files, we could trace it back from the PE file back to .zip file and tgen to .msi and vice versa.

But hey… what does ‘file’ even mean today? Is a .class inside .JAR file a file system object? Or a resource hidden from the file system by the packager abstraction layer?