Analysing NSRL data set for fun and becauseā€¦ curious, Part 3

Nearly two years ago I published a quick summary of my analysis of NSRL data. I believe I was the first one to publicly evaluate this data set, and I still stand by the harsh conclusions I reached back then, today. And what makes me really happy about that 2 year old analysis is a small ripple effect that my posts caused…

I really loved this DFIR science follow-up post – not only Joshua followed my steps and delivered some nice data crunching on the NSRL core dataset to confirm/disprove my findings and hypothesis – he also did some actual benchmarking! I think the results of his experiment prove beyond any doubt that when you blindly do garbage in, there will for sure be garbage out. Also known as: you can use NSRL data better. And then Joshua published his Efficient-NSRL tool as well. So, if you use NSRL set in your investigations, you will benefit from taking a look at my older posts, Joshua’s post, and his Efficient-NSRL tool…

Two years later…

The NSRL data set has changed a lot since 2021, so it’s only natural to come back to its recent incarnation to see what has changed…

The first notable change is that the NSRL data is now distributed as a SQLite3 database only. The schema of the database is available and you can find it inside files named like this:

  • RDS_2023.03.1_modern.schema.sql
  • RDS_2023.06.1_modern_minimal.schema.sql

To create a textual equivalent of the old NSRLFile.txt file one has to follow the recipe provided inside this PDF. Which, of course doesn’t work, because the already-present FILE view (inside the RDS_2023.03.1_modern.db) does not include the crc32 column/field… but we can fix that easily. We just create a new VIEW called FILE2 that includes that missing CRC32 column/field:

CREATE VIEW FILE2 AS
    SELECT
        UPPER(md.sha256) AS sha256,
        UPPER(md.sha1) AS sha1,
        UPPER(md.md5) AS md5,
        UPPER(md.crc32) AS crc32,
        CASE md.extension
        WHEN ''
                THEN md.file_name
                ELSE md.file_name||'.'||md.extension
            END AS file_name,
        md.bytes AS file_size,
        po.package_id
    FROM
        METADATA AS md,
        PACKAGE_OBJECT AS po
    WHERE
        md.object_id = po.object_id

and then we run the export query using a FILE2 view:

DROP TABLE IF EXISTS EXPORT;
CREATE TABLE EXPORT AS SELECT sha1, md5, crc32, file_name, file_size, package_id FROM FILE2;
UPDATE EXPORT SET file_name = REPLACE(file_name, '"', '');
.mode csv
.headers off
.output output.txt
SELECT '"' || sha1 || '"', '"' || md5 || '"', '"' || crc32 || '"', '"' || file_name || '"', file_size,
package_id, '"' || 0 || '"', '"' || '"' FROM EXPORT ORDER BY sha1;

or, if we just want file names:

.output filenames.txt
SELECT file_name FROM EXPORT;

These filenames can be then sorted, counted, etc.

There is a lot more file names in the new set, that’s for sure. It went from 16512841 unique file names I observed in a 2021 set to 23676133 in Jan 2023. Still, lots of it is not that useful, because the actual benign (‘good’) source files are being pushed around, their logical chunks carved out, their sections and class files extracted, etc. – same as before, the most frequent ‘file names’ are PE file section names, MSI table names, Java files, etc… And if you missed the memo, hashes of these logical ‘chunks’ are not very useful as you will never find their binary equivalents present on any file system, except for the ‘worker’ NSRL system(s). Unless your forensic suite can apply hashes to PE file sections, MSI tables, .jar class files – all these ‘partial’ hashes are useless when it comes to ‘mark file as a good, NSRL known file’.

The stats for the top file names are now as follows (for RDS_2023.03.1_modern.db):

  • 9081226 1
  • 7850139 .text
  • 5933107 .reloc
  • 5086051 .data
  • 3634652 version.txt
  • 3101066 .rdata
  • 2923472 CERTIFICATE
  • 2784502 __LINKEDIT
  • 2784113 __TEXT__text
  • 2758779 __TEXT__cstring
  • 2735742 __DATA__data
  • 2718505 __DATA__bss
  • 2667173 __TEXT__const
  • 2629651 __DATA__const
  • 2588460 __DATA__common
  • 2437056 __DATA__mod_init_func
  • 2187040 __DATA__mod_term_func
  • 2164991 __DWARF__debug_abbrev
  • 2164534 __DWARF__debug_line
  • 2164534 __DWARF__debug_info
  • 2164532 __DWARF__debug_aranges
  • 2163269 __DWARF__debug_pubnames
  • 2163268 __DWARF__debug_pubtypes
  • 2162599 __DWARF__debug_str
  • 2162336 __DWARF__debug_frame
  • 2161990 __DWARF__debug_loc
  • 2161722 __DWARF__debug_ranges
  • 2159803 __DWARF__apple_objc
  • 2159803 __DWARF__apple_namespac
  • 2159800 __DWARF__apple_types
  • 2159800 __DWARF__apple_names
  • 2158643 __DWARF__debug_inlined
  • 2157348 __HIB__common
  • 2157348 __HIB__bss
  • 2157347 __KLD__bss
  • 2157346 __HIB__const
  • 2157345 __KLD__cstring

We must admit that it’s s hardly useful.

Having said that, you may be surprised that I still like this dataset a lot, and would still recommend using the NSRL set in your investigations, even if you use it blindly. Yes, it’s not ideal, it may cause your forensic boxes some extra CPU cycles, but it’s at least something. And it’s out there, for free. I also respect the efforts a lot, because a few years ago I made a conscious decision to create a competitive set to NSRL and now I do know now how hard it is…

The bottom line is: know and use all available data sets and tools. Just apply them wisely.

Analysing NSRL data set for fun and because… curious, Part 2

This is the second post discussing what we can find inside the NSRL data set.

At this stage we know it’s not only file hashes, but also sections of executables and java .class files stored inside JAR files. Digging a bit more in the file name statistics we find there is another subset of hashes that is quite substantial: MSI tables. They happen to be named in a very specific way i.e. with an exclamation point as a prefix. There are 350K entries like this:

35666 "!_StringData"
26126 "!_StringPool"
14844 "!File"
11991 "!Property"
11530 "!Error"
9987 "!Media"
9545 "!MsiFileHash"
9063 "!Feature"
9026 "!InstallExecuteSequence"
8949 "!Component"
8844 "!Registry"
8822 "!CustomAction"
8781 "!Directory"
8702 "!FeatureComponents"
7535 "!UIText"
7080 "!_Columns"
7009 "!_Tables"
6609 "!Binary"
5837 "!Control"
5540 "!AdvtExecuteSequence"
5528 "!Upgrade"
5500 "!_Validation"
5236 "!RegLocator"
5174 "!ActionText"
5085 "!InstallUISequence"
4843 "!AdminExecuteSequence"
4704 "!CreateFolder"
4638 "!Dialog"
4606 "!RadioButton"
4087 "!AppSearch"

While it may look like not a lot, if we exclude all file names that start with an exclamation mark, dot (a bit unfair, but a good estimate for section names), and .class files we drop the number of entries by nearly 30%:

192,677,750 - all
 57,390,395 - !<filename>, .<filename>, <filename>.class
135,287,355 - excluding !<filename>, .<filename>, <filename>.class

We could further narrow it down by excluding filenames starting with bracketed numbers f.ex. [5]SummaryInformation or underscores f.ex. __DATA__la_symbol_ptr. There is also a substantial number of filenames that are just numbers, numbers with media file extensions (f.ex. 1494.bmp) or are one way or another related to executable resources (manifest.txt, version.txt, CERTIFICATE, VERSION, etc.).

Another area of interest are files that will be most likely always uniquely bound to the NSRL test systems where they were generated and will never appear on other systems f.ex. files with the following properties. <filename>.pyc, <filename>.pyo, .gitignore.

Kudos to whoever is responsible for maintaining the NSRL set. It is an incredibly difficult task to build a list of good hashes. It’s tempting to unpack, decompile, debundle everything and sometimes this activity may just generate a little bit too much noise. I hope that unless I missed something, future versions of the set will include a flag for each entry to indicate whether the file is an embedded resource or a regular file system object. And another useful entry would be a parent. So that one could rebuild a tree of parent-child relationships leading it back to original ancestor file f.ex. if we start with a .msi file that includes a .zip file that includes a bunch of PE files, we could trace it back from the PE file back to .zip file and tgen to .msi and vice versa.

But hey… what does ‘file’ even mean today? Is a .class inside .JAR file a file system object? Or a resource hidden from the file system by the packager abstraction layer?