Analysing NSRL data set for fun and because… curious, Part 2

This is the second post discussing what we can find inside the NSRL data set.

At this stage we know it’s not only file hashes, but also sections of executables and java .class files stored inside JAR files. Digging a bit more in the file name statistics we find there is another subset of hashes that is quite substantial: MSI tables. They happen to be named in a very specific way i.e. with an exclamation point as a prefix. There are 350K entries like this:

35666 "!_StringData"
26126 "!_StringPool"
14844 "!File"
11991 "!Property"
11530 "!Error"
9987 "!Media"
9545 "!MsiFileHash"
9063 "!Feature"
9026 "!InstallExecuteSequence"
8949 "!Component"
8844 "!Registry"
8822 "!CustomAction"
8781 "!Directory"
8702 "!FeatureComponents"
7535 "!UIText"
7080 "!_Columns"
7009 "!_Tables"
6609 "!Binary"
5837 "!Control"
5540 "!AdvtExecuteSequence"
5528 "!Upgrade"
5500 "!_Validation"
5236 "!RegLocator"
5174 "!ActionText"
5085 "!InstallUISequence"
4843 "!AdminExecuteSequence"
4704 "!CreateFolder"
4638 "!Dialog"
4606 "!RadioButton"
4087 "!AppSearch"

While it may look like not a lot, if we exclude all file names that start with an exclamation mark, dot (a bit unfair, but a good estimate for section names), and .class files we drop the number of entries by nearly 30%:

192,677,750 - all
 57,390,395 - !<filename>, .<filename>, <filename>.class
135,287,355 - excluding !<filename>, .<filename>, <filename>.class

We could further narrow it down by excluding filenames starting with bracketed numbers f.ex. [5]SummaryInformation or underscores f.ex. __DATA__la_symbol_ptr. There is also a substantial number of filenames that are just numbers, numbers with media file extensions (f.ex. 1494.bmp) or are one way or another related to executable resources (manifest.txt, version.txt, CERTIFICATE, VERSION, etc.).

Another area of interest are files that will be most likely always uniquely bound to the NSRL test systems where they were generated and will never appear on other systems f.ex. files with the following properties. <filename>.pyc, <filename>.pyo, .gitignore.

Kudos to whoever is responsible for maintaining the NSRL set. It is an incredibly difficult task to build a list of good hashes. It’s tempting to unpack, decompile, debundle everything and sometimes this activity may just generate a little bit too much noise. I hope that unless I missed something, future versions of the set will include a flag for each entry to indicate whether the file is an embedded resource or a regular file system object. And another useful entry would be a parent. So that one could rebuild a tree of parent-child relationships leading it back to original ancestor file f.ex. if we start with a .msi file that includes a .zip file that includes a bunch of PE files, we could trace it back from the PE file back to .zip file and tgen to .msi and vice versa.

But hey… what does ‘file’ even mean today? Is a .class inside .JAR file a file system object? Or a resource hidden from the file system by the packager abstraction layer?

Analysing NSRL data set for fun and because… curious

Last year I took a very quick look at NSRL hash set. Being de facto golden standard of good hashes I was curious what sort of data is actually included in it. There is no better way of looking at it than actually looking at it so I downloaded the files and started some basic analysis.

The NSRLFile.txt.zip stores 26G file NSRLFile.txt that includes the following number of entries:

192677750 NSRLFile.txt

Yay, that’s a lot!

Now that we know how many entries we have in it, let’s try to see what sort of file extensions we can find inside the file. After parsing the data set I came up with the following results (top 30):

class 44776614
<no ext> 36040752
png 8968329
manifest 6258021
js 4846976
foliage 4513590
py 3938119
dll 3658422
java 3263630
nasl 3256040
html 3068834
h 2671295
xml 2582138
cat 2355352
htm 2197584
mum 2164820
txt 1906815
uasset 1859095
dat 1779063
o 1749951
c 1697767
gz 1691747
mui 1377019
svg 1018434
properties 1004380
mo 1000680
gif 973653
ogg 842367
dds 822591
upk 772181

This is a very interesting statistic — it tells us that a substantial number of records relate to .class files – compiled Java files. This is most likely a result of an automated processing where every single .JAR file is being unzipped and every embedded .class files is accounted for. There are tones of non-executables as well (media files, source code, etc.). It’s a majority of them, really.

Looking specifically for executable / installer files we quickly realize that these are indeed quite scarce:

dll    3658422
o    1749951
bin    557524
so    504442
exe    346292
sys    98390
msi    24763
bat    31482
ps1    13433
cmd    10931
scr    8567
drv    5350

File extensions tell us roughly what file types we deal with, but another great way to look at the NSRL set is statistical analysis of its file names. A quick & dirty histogram I came up with looks like this (top 30):

1221091 ".text"
744897 "1"
719540 "text" 
641856 ".reloc" 
630123 "__bitcode" 
579898 "version.txt" 
530652 ".data" 
457126 "__compact_unwind" 
393021 "__gcc_except_tab" 
387403 "__eh_frame" 
312569 ".rodata" 
312238 "CERTIFICATE" 
265511 "__init.py"
235670 ".rdata"
230514 "Makefile"
214357 ".dynamic"
212477 ".dynsym"
208787 "pathname"
208787 "asset.meta"
208594 ".dynstr"
201074 ".eh_frame"
198758 ".symtab"
194931 ".init"
194760 ".note.gnu.build-id"
194215 ".strtab"
185339 "__const"
184099 "English.dat"
182945 ".gnu.version"
176370 "Master.js"
175495 ".gnu.version_r"

This is yet another interesting statistic that tells us that what we believed to be just large file hashset is actually a mix of files hashes and hashes of sections of executable files. These are useful in malware analysis, but not so much in forensic time-saving exercise that relies on excluding files with known hashes. In other words, if you blindly use NSRL hashset on your forensic images, you are wasting time testing correlations that by the sole nature of NSRL data set will be a waste of CPU cycles. The only set applied to the file system should be hashes of actual files, not their chunks.

The third lesson is pretty obvious — if you plan on using NSRL hashset it’s best to have different sets for different operating systems and for that, we can leverage OpSystemCode field (use a OS-specific subset for the target image). Oh, but not so fast. There are 1300+ OS versions listed inside NSRLOS.txt file. The file includes Amstrad, MSDOS, Novell Netware, NextStep, AIX and many other kinda derelict platforms. If you plan on using NSRL it’s probably good to simply exclude files belonging to these ancient systems first.

Let’s be honest – there is a lot of value in NSRL hashset, but there is always more than just one “but” and I have listed a few. This is not to discourage using it, but be a bit more choosy how we use it and selectively cherry-pick subsets of its data for the time-sensitive analysis.