Analysing NSRL data set for fun and because… curious

Last year I took a very quick look at NSRL hash set. Being de facto golden standard of good hashes I was curious what sort of data is actually included in it. There is no better way of looking at it than actually looking at it so I downloaded the files and started some basic analysis.

The stores 26G file NSRLFile.txt that includes the following number of entries:

192677750 NSRLFile.txt

Yay, that’s a lot!

Now that we know how many entries we have in it, let’s try to see what sort of file extensions we can find inside the file. After parsing the data set I came up with the following results (top 30):

class 44776614
<no ext> 36040752
png 8968329
manifest 6258021
js 4846976
foliage 4513590
py 3938119
dll 3658422
java 3263630
nasl 3256040
html 3068834
h 2671295
xml 2582138
cat 2355352
htm 2197584
mum 2164820
txt 1906815
uasset 1859095
dat 1779063
o 1749951
c 1697767
gz 1691747
mui 1377019
svg 1018434
properties 1004380
mo 1000680
gif 973653
ogg 842367
dds 822591
upk 772181

This is a very interesting statistic — it tells us that a substantial number of records relate to .class files – compiled Java files. This is most likely a result of an automated processing where every single .JAR file is being unzipped and every embedded .class files is accounted for. There are tones of non-executables as well (media files, source code, etc.). It’s a majority of them, really.

Looking specifically for executable / installer files we quickly realize that these are indeed quite scarce:

dll    3658422
o    1749951
bin    557524
so    504442
exe    346292
sys    98390
msi    24763
bat    31482
ps1    13433
cmd    10931
scr    8567
drv    5350

File extensions tell us roughly what file types we deal with, but another great way to look at the NSRL set is statistical analysis of its file names. A quick & dirty histogram I came up with looks like this (top 30):

1221091 ".text"
744897 "1"
719540 "text" 
641856 ".reloc" 
630123 "__bitcode" 
579898 "version.txt" 
530652 ".data" 
457126 "__compact_unwind" 
393021 "__gcc_except_tab" 
387403 "__eh_frame" 
312569 ".rodata" 
265511 ""
235670 ".rdata"
230514 "Makefile"
214357 ".dynamic"
212477 ".dynsym"
208787 "pathname"
208787 "asset.meta"
208594 ".dynstr"
201074 ".eh_frame"
198758 ".symtab"
194931 ".init"
194760 ""
194215 ".strtab"
185339 "__const"
184099 "English.dat"
182945 ".gnu.version"
176370 "Master.js"
175495 ".gnu.version_r"

This is yet another interesting statistic that tells us that what we believed to be just large file hashset is actually a mix of files hashes and hashes of sections of executable files. These are useful in malware analysis, but not so much in forensic time-saving exercise that relies on excluding files with known hashes. In other words, if you blindly use NSRL hashset on your forensic images, you are wasting time testing correlations that by the sole nature of NSRL data set will be a waste of CPU cycles. The only set applied to the file system should be hashes of actual files, not their chunks.

The third lesson is pretty obvious — if you plan on using NSRL hashset it’s best to have different sets for different operating systems and for that, we can leverage OpSystemCode field (use a OS-specific subset for the target image). Oh, but not so fast. There are 1300+ OS versions listed inside NSRLOS.txt file. The file includes Amstrad, MSDOS, Novell Netware, NextStep, AIX and many other kinda derelict platforms. If you plan on using NSRL it’s probably good to simply exclude files belonging to these ancient systems first.

Let’s be honest – there is a lot of value in NSRL hashset, but there is always more than just one “but” and I have listed a few. This is not to discourage using it, but be a bit more choosy how we use it and selectively cherry-pick subsets of its data for the time-sensitive analysis.