{"id":7968,"date":"2022-02-04T22:45:30","date_gmt":"2022-02-04T22:45:30","guid":{"rendered":"https:\/\/www.hexacorn.com\/blog\/?p=7968"},"modified":"2022-02-04T22:54:47","modified_gmt":"2022-02-04T22:54:47","slug":"analysing-nsrl-data-set-for-fun-and-because-curious","status":"publish","type":"post","link":"https:\/\/www.hexacorn.com\/blog\/2022\/02\/04\/analysing-nsrl-data-set-for-fun-and-because-curious\/","title":{"rendered":"Analysing NSRL data set for fun and because&#8230; curious"},"content":{"rendered":"\n<p>Last year I took a very quick look at NSRL hash set. Being de facto golden standard of <em>good hashes<\/em> I was curious what sort of data is actually included in it. There is no better way of looking at it than actually looking at it so I downloaded the files and started some basic analysis.<\/p>\n\n\n\n<p>The <em>NSRLFile.txt.zip<\/em> stores 26G file <em>NSRLFile.txt<\/em> that includes the following number of entries:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">192677750 NSRLFile.txt<\/pre>\n\n\n\n<p>Yay, that&#8217;s a lot!<\/p>\n\n\n\n<p>Now that we know how many entries we have in it, let&#8217;s try to see what sort of file extensions we can find inside the file. After parsing the data set I came up with the following results (top 30):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">class 44776614\n&lt;no ext&gt; 36040752\npng 8968329\nmanifest 6258021\njs 4846976\nfoliage 4513590\npy 3938119\ndll 3658422\njava 3263630\nnasl 3256040\nhtml 3068834\nh 2671295\nxml 2582138\ncat 2355352\nhtm 2197584\nmum 2164820\ntxt 1906815\nuasset 1859095\ndat 1779063\no 1749951\nc 1697767\ngz 1691747\nmui 1377019\nsvg 1018434\nproperties 1004380\nmo 1000680\ngif 973653\nogg 842367\ndds 822591\nupk 772181<\/pre>\n\n\n\n<p>This is a very interesting statistic &#8212; it tells us that a substantial number of records relate to<em> .class<\/em> files &#8211; compiled Java files. This is most likely a result of an automated processing where every single .JAR file is being unzipped and every embedded <em>.class<\/em> files is accounted for. There are tones of non-executables as well (media files, source code, etc.). It&#8217;s a majority of them, really.<\/p>\n\n\n\n<p>Looking specifically for executable \/ installer files we quickly realize that these are indeed quite scarce:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">dll    3658422\no    1749951\nbin    557524\nso    504442\nexe    346292\nsys    98390\nmsi    24763\nbat    31482\nps1    13433\ncmd    10931\nscr    8567\ndrv    5350<\/pre>\n\n\n\n<p>File extensions tell us roughly what file types we deal with, but another great way to look at the NSRL set is statistical analysis of its file names. A quick &amp; dirty histogram I came up with looks like this (top 30):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">1221091 \".text\"\n744897 \"1\"\n719540 \"text\" \n641856 \".reloc\" \n630123 \"__bitcode\" \n579898 \"version.txt\" \n530652 \".data\" \n457126 \"__compact_unwind\" \n393021 \"__gcc_except_tab\" \n387403 \"__eh_frame\" \n312569 \".rodata\" \n312238 \"CERTIFICATE\" \n265511 \"__init.py\"\n235670 \".rdata\"\n230514 \"Makefile\"\n214357 \".dynamic\"\n212477 \".dynsym\"\n208787 \"pathname\"\n208787 \"asset.meta\"\n208594 \".dynstr\"\n201074 \".eh_frame\"\n198758 \".symtab\"\n194931 \".init\"\n194760 \".note.gnu.build-id\"\n194215 \".strtab\"\n185339 \"__const\"\n184099 \"English.dat\"\n182945 \".gnu.version\"\n176370 \"Master.js\"\n175495 \".gnu.version_r\"<\/pre>\n\n\n\n<p>This is yet another interesting statistic that tells us that what we believed to be just large file hashset is actually a mix of files hashes and hashes of sections of executable files. These are useful in malware analysis, but not so much in forensic time-saving exercise that relies on excluding files with known hashes. In other words, if you blindly use NSRL hashset on your forensic images, you are wasting time testing correlations that by the sole nature of NSRL data set will be a waste of CPU cycles. The only set applied to the file system should be hashes of actual files, not their chunks.<\/p>\n\n\n\n<p>The third lesson is pretty obvious &#8212; if you plan on using NSRL hashset it&#8217;s best to have different sets for different operating systems and for that, we can leverage OpSystemCode field (use a OS-specific subset for the target image). Oh, but not so fast. There are 1300+ OS versions listed inside <em>NSRLOS.txt<\/em> file. The <a href=\"https:\/\/hexacorn.com\/d\/NSRLOS-2.txt\">file<\/a> includes Amstrad, MSDOS, Novell Netware, NextStep, AIX and many other kinda derelict platforms. If you plan on using NSRL it&#8217;s probably good to simply exclude files belonging to these ancient systems first.<\/p>\n\n\n\n<p>Let&#8217;s be honest &#8211; there is a lot of value in NSRL hashset, but there is always more than just one &#8220;but&#8221; and I have listed a few. This is not to discourage using it, but be a bit more choosy how we use it and selectively cherry-pick subsets of its data for the time-sensitive analysis. <\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last year I took a very quick look at NSRL hash set. Being de facto golden standard of good hashes I was curious what sort of data is actually included in it. There is no better way of looking at &hellip; <a href=\"https:\/\/www.hexacorn.com\/blog\/2022\/02\/04\/analysing-nsrl-data-set-for-fun-and-because-curious\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[19,100],"tags":[],"_links":{"self":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/7968"}],"collection":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/comments?post=7968"}],"version-history":[{"count":7,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/7968\/revisions"}],"predecessor-version":[{"id":7976,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/7968\/revisions\/7976"}],"wp:attachment":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/media?parent=7968"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/categories?post=7968"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/tags?post=7968"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}