{"id":8715,"date":"2023-09-16T22:11:55","date_gmt":"2023-09-16T22:11:55","guid":{"rendered":"https:\/\/www.hexacorn.com\/blog\/?p=8715"},"modified":"2023-09-16T22:21:00","modified_gmt":"2023-09-16T22:21:00","slug":"analysing-nsrl-data-set-for-fun-and-because-curious-part-3","status":"publish","type":"post","link":"https:\/\/www.hexacorn.com\/blog\/2023\/09\/16\/analysing-nsrl-data-set-for-fun-and-because-curious-part-3\/","title":{"rendered":"Analysing NSRL data set for fun and because\u2026 curious, Part 3"},"content":{"rendered":"\n<p>Nearly two years ago I published a <a href=\"https:\/\/www.hexacorn.com\/blog\/2022\/02\/04\/analysing-nsrl-data-set-for-fun-and-because-curious\/\" data-type=\"post\" data-id=\"7968\">quick<\/a> <a href=\"https:\/\/www.hexacorn.com\/blog\/2022\/02\/06\/analysing-nsrl-data-set-for-fun-and-because-curious-part-2\/\" data-type=\"post\" data-id=\"7979\">summary<\/a> of my analysis of <a href=\"https:\/\/www.nist.gov\/itl\/ssd\/software-quality-group\/national-software-reference-library-nsrl\">NSRL<\/a> data. I believe I was the first one to publicly evaluate this data set, and I still stand by the harsh conclusions I reached back then, today. And what makes me really happy about that 2 year old analysis is a small ripple effect that my posts caused&#8230;<\/p>\n\n\n\n<p>I really loved this <a href=\"https:\/\/twitter.com\/dfirscience\">DFIR science<\/a> follow-up <a href=\"https:\/\/dfir.science\/2022\/02\/A-more-efficient-NSRL-for-digital-forensics\">post<\/a> &#8211; not only Joshua followed my steps and delivered some nice data crunching on the NSRL core dataset to confirm\/disprove my findings and hypothesis &#8211; he also did some actual benchmarking! I think the results of his experiment prove beyond any doubt that when you blindly do <em>garbage in<\/em>, there will for sure be <em>garbage out<\/em>. Also known as: you can use NSRL data better. And then Joshua published his <a href=\"https:\/\/github.com\/DFIRScience\/Efficient-NSRL\">Efficient-NSRL<\/a> tool as well.  So, if you use NSRL set in your investigations, you will benefit from taking a look at my older posts, Joshua&#8217;s post, and his Efficient-NSRL tool&#8230;<\/p>\n\n\n\n<p>Two years later&#8230; <\/p>\n\n\n\n<p>The NSRL data set has changed a lot since 2021, so it&#8217;s only natural to come back to its recent incarnation to see what has changed&#8230;<\/p>\n\n\n\n<p>The first notable change is that the NSRL data is now distributed as a SQLite3 database only. The schema of the database is available and you can find it inside files named like this:<\/p>\n\n\n\n<ul>\n<li>RDS_2023.03.1_modern.schema.sql<\/li>\n\n\n\n<li>RDS_2023.06.1_modern_minimal.schema.sql<\/li>\n<\/ul>\n\n\n\n<p>To create a textual equivalent of the old <em>NSRLFile.txt<\/em> file one has to follow the recipe provided inside <a href=\"https:\/\/s3.amazonaws.com\/rds.nsrl.nist.gov\/RDS\/RDSv3_Docs\/RDSv3_to_RDSv2_text_files.pdf\">this PDF<\/a>. Which, of course doesn&#8217;t work, because the already-present FILE view (inside the <em>RDS_2023.03.1_modern.db<\/em>) does not include the crc32 column\/field&#8230; but we can fix that easily. We just create a new VIEW called FILE2 that includes that missing CRC32 column\/field:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">CREATE VIEW FILE2 AS\n    SELECT\n        UPPER(md.sha256) AS sha256,\n        UPPER(md.sha1) AS sha1,\n        UPPER(md.md5) AS md5,\n        UPPER(md.crc32) AS crc32,\n        CASE md.extension\n        WHEN ''\n                THEN md.file_name\n                ELSE md.file_name||'.'||md.extension\n            END AS file_name,\n        md.bytes AS file_size,\n        po.package_id\n    FROM\n        METADATA AS md,\n        PACKAGE_OBJECT AS po\n    WHERE\n        md.object_id = po.object_id<\/pre>\n\n\n\n<p>and then we run the export query using a FILE2 view:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">DROP TABLE IF EXISTS EXPORT;\nCREATE TABLE EXPORT AS SELECT sha1, md5, crc32, file_name, file_size, package_id FROM FILE2;\nUPDATE EXPORT SET file_name = REPLACE(file_name, '\"', '');\n.mode csv\n.headers off\n.output output.txt\nSELECT '\"' || sha1 || '\"', '\"' || md5 || '\"', '\"' || crc32 || '\"', '\"' || file_name || '\"', file_size,\npackage_id, '\"' || 0 || '\"', '\"' || '\"' FROM EXPORT ORDER BY sha1;<\/pre>\n\n\n\n<p>or, if we just want file names:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">.output filenames.txt\nSELECT file_name FROM EXPORT;<\/pre>\n\n\n\n<p>These filenames can be then sorted, counted, etc.<\/p>\n\n\n\n<p>There is a lot more file names in the new set, that&#8217;s for sure. It went from 16512841 unique file names I observed in a 2021 set to 23676133 in Jan 2023. Still, lots of it is not that useful, because the actual benign (&#8216;good&#8217;) source files are being pushed around, their logical chunks carved out, their sections and class files extracted, etc. &#8211; same as before, the most frequent &#8216;file names&#8217; are PE file section names, MSI table names, Java files, etc&#8230; And if you missed the memo, hashes of these logical &#8216;chunks&#8217; are not very useful as you will never find their binary equivalents present on any file system, except for the &#8216;worker&#8217; NSRL system(s). Unless your forensic suite can apply hashes to PE file sections, MSI tables, .jar class files &#8211; all these &#8216;partial&#8217; hashes are useless when it comes to &#8216;mark file as a good, NSRL known file&#8217;.<\/p>\n\n\n\n<p>The stats for the top file names are now as follows (for RDS_2023.03.1_modern.db):<\/p>\n\n\n\n<ul>\n<li>9081226 1<\/li>\n\n\n\n<li>7850139 .text<\/li>\n\n\n\n<li>5933107 .reloc<\/li>\n\n\n\n<li>5086051 .data<\/li>\n\n\n\n<li>3634652 version.txt<\/li>\n\n\n\n<li>3101066 .rdata<\/li>\n\n\n\n<li>2923472 CERTIFICATE<\/li>\n\n\n\n<li>2784502 __LINKEDIT<\/li>\n\n\n\n<li>2784113 __TEXT__text<\/li>\n\n\n\n<li>2758779 __TEXT__cstring<\/li>\n\n\n\n<li>2735742 __DATA__data<\/li>\n\n\n\n<li>2718505 __DATA__bss<\/li>\n\n\n\n<li>2667173 __TEXT__const<\/li>\n\n\n\n<li>2629651 __DATA__const<\/li>\n\n\n\n<li>2588460 __DATA__common<\/li>\n\n\n\n<li>2437056 __DATA__mod_init_func<\/li>\n\n\n\n<li>2187040 __DATA__mod_term_func<\/li>\n\n\n\n<li>2164991 __DWARF__debug_abbrev<\/li>\n\n\n\n<li>2164534 __DWARF__debug_line<\/li>\n\n\n\n<li>2164534 __DWARF__debug_info<\/li>\n\n\n\n<li>2164532 __DWARF__debug_aranges<\/li>\n\n\n\n<li>2163269 __DWARF__debug_pubnames<\/li>\n\n\n\n<li>2163268 __DWARF__debug_pubtypes<\/li>\n\n\n\n<li>2162599 __DWARF__debug_str<\/li>\n\n\n\n<li>2162336 __DWARF__debug_frame<\/li>\n\n\n\n<li>2161990 __DWARF__debug_loc<\/li>\n\n\n\n<li>2161722 __DWARF__debug_ranges<\/li>\n\n\n\n<li>2159803 __DWARF__apple_objc<\/li>\n\n\n\n<li>2159803 __DWARF__apple_namespac<\/li>\n\n\n\n<li>2159800 __DWARF__apple_types<\/li>\n\n\n\n<li>2159800 __DWARF__apple_names<\/li>\n\n\n\n<li>2158643 __DWARF__debug_inlined<\/li>\n\n\n\n<li>2157348 __HIB__common<\/li>\n\n\n\n<li>2157348 __HIB__bss<\/li>\n\n\n\n<li>2157347 __KLD__bss<\/li>\n\n\n\n<li>2157346 __HIB__const<\/li>\n\n\n\n<li>2157345 __KLD__cstring<\/li>\n<\/ul>\n\n\n\n<p>We must admit that it&#8217;s s hardly useful.<\/p>\n\n\n\n<p>Having said that, you may be surprised that I still like this dataset a lot, and would still recommend using the NSRL set in your investigations, even if you use it blindly. Yes, it&#8217;s not ideal, it may cause your forensic boxes some extra CPU cycles, but it&#8217;s at least something. And it&#8217;s out there, for free. I also respect the efforts a lot, because a few years ago I made a conscious decision to create a competitive set to NSRL and now I do know now how hard it is&#8230;<\/p>\n\n\n\n<p>The bottom line is: know and use all available data sets and tools. Just apply them wisely.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Nearly two years ago I published a quick summary of my analysis of NSRL data. I believe I was the first one to publicly evaluate this data set, and I still stand by the harsh conclusions I reached back then, &hellip; <a href=\"https:\/\/www.hexacorn.com\/blog\/2023\/09\/16\/analysing-nsrl-data-set-for-fun-and-because-curious-part-3\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[19,100],"tags":[],"_links":{"self":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/8715"}],"collection":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/comments?post=8715"}],"version-history":[{"count":11,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/8715\/revisions"}],"predecessor-version":[{"id":8733,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/8715\/revisions\/8733"}],"wp:attachment":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/media?parent=8715"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/categories?post=8715"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/tags?post=8715"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}