Analysing NSRL data set for fun and because… curious

Last year I took a very quick look at NSRL hash set. Being de facto golden standard of good hashes I was curious what sort of data is actually included in it. There is no better way of looking at it than actually looking at it so I downloaded the files and started some basic analysis.

The NSRLFile.txt.zip stores 26G file NSRLFile.txt that includes the following number of entries:

192677750 NSRLFile.txt

Yay, that’s a lot!

Now that we know how many entries we have in it, let’s try to see what sort of file extensions we can find inside the file. After parsing the data set I came up with the following results (top 30):

class 44776614
<no ext> 36040752
png 8968329
manifest 6258021
js 4846976
foliage 4513590
py 3938119
dll 3658422
java 3263630
nasl 3256040
html 3068834
h 2671295
xml 2582138
cat 2355352
htm 2197584
mum 2164820
txt 1906815
uasset 1859095
dat 1779063
o 1749951
c 1697767
gz 1691747
mui 1377019
svg 1018434
properties 1004380
mo 1000680
gif 973653
ogg 842367
dds 822591
upk 772181

This is a very interesting statistic — it tells us that a substantial number of records relate to .class files – compiled Java files. This is most likely a result of an automated processing where every single .JAR file is being unzipped and every embedded .class files is accounted for. There are tones of non-executables as well (media files, source code, etc.). It’s a majority of them, really.

Looking specifically for executable / installer files we quickly realize that these are indeed quite scarce:

dll    3658422
o    1749951
bin    557524
so    504442
exe    346292
sys    98390
msi    24763
bat    31482
ps1    13433
cmd    10931
scr    8567
drv    5350

File extensions tell us roughly what file types we deal with, but another great way to look at the NSRL set is statistical analysis of its file names. A quick & dirty histogram I came up with looks like this (top 30):

1221091 ".text"
744897 "1"
719540 "text" 
641856 ".reloc" 
630123 "__bitcode" 
579898 "version.txt" 
530652 ".data" 
457126 "__compact_unwind" 
393021 "__gcc_except_tab" 
387403 "__eh_frame" 
312569 ".rodata" 
312238 "CERTIFICATE" 
265511 "__init.py"
235670 ".rdata"
230514 "Makefile"
214357 ".dynamic"
212477 ".dynsym"
208787 "pathname"
208787 "asset.meta"
208594 ".dynstr"
201074 ".eh_frame"
198758 ".symtab"
194931 ".init"
194760 ".note.gnu.build-id"
194215 ".strtab"
185339 "__const"
184099 "English.dat"
182945 ".gnu.version"
176370 "Master.js"
175495 ".gnu.version_r"

This is yet another interesting statistic that tells us that what we believed to be just large file hashset is actually a mix of files hashes and hashes of sections of executable files. These are useful in malware analysis, but not so much in forensic time-saving exercise that relies on excluding files with known hashes. In other words, if you blindly use NSRL hashset on your forensic images, you are wasting time testing correlations that by the sole nature of NSRL data set will be a waste of CPU cycles. The only set applied to the file system should be hashes of actual files, not their chunks.

The third lesson is pretty obvious — if you plan on using NSRL hashset it’s best to have different sets for different operating systems and for that, we can leverage OpSystemCode field (use a OS-specific subset for the target image). Oh, but not so fast. There are 1300+ OS versions listed inside NSRLOS.txt file. The file includes Amstrad, MSDOS, Novell Netware, NextStep, AIX and many other kinda derelict platforms. If you plan on using NSRL it’s probably good to simply exclude files belonging to these ancient systems first.

Let’s be honest – there is a lot of value in NSRL hashset, but there is always more than just one “but” and I have listed a few. This is not to discourage using it, but be a bit more choosy how we use it and selectively cherry-pick subsets of its data for the time-sensitive analysis.

Delphi API monitoring with Frida

This is just a simple proof of concept that can be extended to build a full-blown Delphi API Monitor.

Delphi lives in its own API ecosystem. Reversing Delphi applications requires us to use a dedicated tool/decompiler (e.g. IDR), flirt signatures, and most of this work relies on DCU32INT decompiler. When I built my sandbox and wanted to add Delphi support I created some mini-signatures for some of the more crucial Delphi APIs and anytime Delphi app would be analyzed, I’d look for code patterns, patch them with my API hook, and then observe the results (I described it here).

With the invention of new reversing tools we have an opportunity to re-visit this topic to rapidly produce a prototype of a Delphi API monitor that will be fast, robust and will cover most angles.

Before we begin, couple of points first:

  • Multiple versions of the same API exist:
    • it’s just a different binary encoding of the same functions that made it to Delphi DCUs f.ex. LStrFromString:
      • 870C245131C98A0A42E9xxxxxxxxC3
      • FF3424894C240431C98A0A42E9xxxxxxxxC3
    • you may also come across differences in API declarations e.g.:
      • LStrLen (const S: AnsiString)
      • LStrLen (const S: string)
  • Delphi APIs use a different calling convention, so need to take it into account while writing Frida handlers — eax, edx registers being the registers that Delphi uses to pass 2 first arguments
  • Strings used by Delphi are encoded differently than in C, with the most typical being a length of the string encoded in first byte followed by the actual string (there are others)
  • Frida hooking engine accepts both Win32 API module/API names and addresses; the addresses need to be provided as RVA offsets within a monitored module f.ex. -a foo.exe!address
  • For each foo.exe!address, you need to create a respective handler called sub_<address>.js e.g. sub_34AF.js.

With that, we just need to find an application for testing, and write our first handler.

The old Resource Hacker is written in Delphi. Using IDA we can quickly identify one of its comparison functions PStrCmp at address 0x004029E0 (RVA=29E0):

The example handler showing the calls to this API with parameters can look like this:

{
   onEnter(log, args, state) {
     eax_len = this.context.eax.readS8(); 
     edx_len = this.context.edx.readS8(); 
     eax_str = this.context.eax.add(1).readUtf8String(eax_len);
     edx_str = this.context.edx.add(1).readUtf8String(edx_len);
 console.log(this.context.eip + ":" + eax_str+" "+edx_str);
 },
 onLeave(log, retval, state) {
   }
 }

Now if we launch rsold.exe under frida-trace:

frida-trace c:\test\rsold.exe c:\windows\notepad.exe -a rsold.exe!2A64

which will tell frida-tools to load old Resource Hacker (rsold.exe) and make it open resources of c:\windows\notepad.exe, and add API hook for PStrCmp (RVA=29E0 –> handlers\rsold.exe\sub_2a64.js), we get result like this:

Now that we know what we can do with it, there are at least 2 different avenues we can follow:

  • Write an idapython script that will export handlers for a given binary and for our APIs of choice
  • Use DCU32INT and export code for functions of interest from as many Delphi/CodeGear/Embarcadero versions as possible, then convert them into regular expressions (or leverage yara) and build signatures; find these signatures inside target Delphi PE files and convert file offsets of matched hits to RVA offsets, and finally export handlers for all functions of interest (no need for IDA in this case)

What are interesting APIs to handle?

Could start with strings — these are often great to understand the inner workings of programs:

  • LStrCat
  • LStrFromPWChar
  • LStrFromPWCharLen
  • LStrCatN
  • LStrCat3
  • LStrSetLength
  • LStrFromPChar
  • LStrAsg
  • LStrCopy
  • LStrCmp
  • LStrLAsg
  • LStrInsert
  • LStrDelete
  • LStrArrayClr
  • LStrToPChar
  • LStrFromPCharLen
  • LStrClr
  • LStrFromWArray
  • LStrFromWStr
  • LStrFromArray
  • LStrFromChar
  • LStrFromWChar
  • LStrFromUStr
  • LStrAddRef
  • LStrToString
  • LStrFromString
  • LStrEqual
  • LStrPos
  • LStrLen
  • LStrFromLenStr
  • LStrOfChar

File operations are of interest as well f.ex.:

  • ChangeFileExt
  • CreateDir
  • DateTimeToFileDate
  • DeleteFile
  • DiskFree
  • DiskSize
  • ExpandFileName
  • ExpandUNCFileName
  • ExtractFileDir
  • ExtractFileDrive
  • ExtractFileExt
  • ExtractFileName
  • ExtractFilePath
  • FileAge
  • FileClose
  • FileCreate
  • FileDateToDateTime
  • FileExists
  • FileGetAttr
  • FileGetDate
  • FileOpen
  • FileRead
  • FileSearch
  • FileSeek
  • FileSetAttr
  • FileSetDate
  • FileWrite
  • FindClose
  • FindFirst
  • FindNext
  • GetCurrentDir
  • RemoveDir
  • RenameFile
  • SetCurrentDir