The art of artifact collection and hoarding for the sake of forensic exclusivity…

This post is going to blow your mind – I am going to demonstrate that the piracy is good! (sometimes)

I like to challenge the forensic processes du jour. At least in my head.

Today we often use this forensic suite/tool, or that forensic script (or their set) to read and process the forensic evidence in its native form: NTFS, extX, APFS file systems, OS folders and files of interest, Event Logs, Memory Dumps, Cloud logs, Server logs, any other available Telemetry, etc. and then we do some spelunking, bookmarking, filtering, and of course we do a surgical artifact of interest identification, parsing, normalization and cross-referencing, and finally supertimelining it all in no time. In many cases… the available, well tested, generic forensic data processing and automation processes are doing so well, and doing such a phenomenal work, and yes, we have progressed so much, and on so many fronts over last decade, that the actual act of old school forensic process is getting kinda lost today… BUT the good news is that we close cases faster than ever…

One may ask: is there still any reason out there to think of better forensic methods to analyze data today?

YES!

While many ‘cyber’ cases are truly benefiting from these fast, targeted, surgical, and automated artifact analysis and processing pipelines… we still have many cases where the forensic exam must be done in the ‘beyond a reasonable doubt’ way. In my first forensic job I was primarily working non-LE cases but sometimes was supporting the team that was analyzing CSAM and other criminal content. I must say that their attention to detail shaped my professional life in many positive ways. But at that time I had a lot of questions: f.ex. I was very perplexed that they were insisting on a very mundane, one-by-one, manual analysis of all media files on the devices they were examining. In my eyes, they were wasting a lot of time (many of the pictures were just small icons, many shipped by default as a part of OS), and should have been optimizing this process… Only later the reason for their approach became more apparent to me: any miss on their side would mean a compromise of their forensic expert profile and integrity that the other side (in court) could exploit.

Hmm but I still think browsing all these tiny, built-in media files was a waste of time…

In some other job I did, I worked closely with the DFIR team that did a lot of e-discovery work. Again, I was perplexed that they were spending hours browsing through peoples’ mailboxes looking for any sign of ‘bad’. Very, very boring and poor ROI. BUT IT HAD TO BE DONE RIGHT.

It’s safe to say that every forensic exam has its own objectives. And because of this, while many of these objectives appear to be achievable only via a brute-force approach, I do believe there are still many research avenues out there that – if successful – could benefit these forensic processes and in the end make these objectives faster to achieve – both in general, and more specific, targeted cases.

A good example is my (sorry for an ego trip) filighting idea that can be used to reduce the amount of evidence we need to process… and it’s thanks to a simple observation: a legitimate software will consist of files and resources that are being referenced from the inside of the software itself.

We can also entertain more ‘horizontal’ data reduction techniques like the idea of targeting specific, function-specific files f.ex. license files. But there is more…

If we look at a random instance of a Windows system, we can observe a number of patterns:

  • Windows OS files — often protected from deletion and signed (nowadays often via catalogue signing), includes lots of subfolders — many of which have been exploited by threat actors
  • Program Files — existing in many localized versions, and many architecture-specific versions mostly contain files belonging to legitimate software packages (and sometimes supply-chain attack files too, though)
  • Common Files — as above, for many software packages
  • TMP/TEMP folders — both c:\windows\temp and user-specific temporary folders
  • USERPROFILE folder — a lot is going on here, definitely a subject to further analysis
  • ProgramData folder — as above, lots is going here
  • Dedicated Dev folders — where devs develop and test their code
  • Portable Apps — usually stored in dedicated, separated folders
  • Local or Legacy apps — often saved in c:\<appfolder>\ directories
  • Shared folders — often just a wrong configuration problem
  • etc.

When you look at the file system-based evidence from the perspective of file clusters it may become apparent that many of these ‘default’ directories start their existence with a very predetermined, baseline-like list of files inside them.

For instance:

  • c:\windows\notepad.exe
  • c:\windows\System32\kernel32.dll
  • c:\windows\System32\KernelBase.dll
  • c:\windows\SysWow64\kernel32.dll
  • c:\windows\SysWow64\KernelBase.dll
  • and hundreds, if not thousands of other file names like this

These are HARD to modify today.

Imagine getting a Windows OS-based image as an evidence. You first exclude all files that have the hash matching your ‘clean hash’ list, then you exclude all files that are 100% OS files (based on their paths AND/OR signatures, including catalogue signing), then you exclude all that have the corresponding symbol file on Microsoft symbol servers (but be mindful of SigFlip attack), then you exclude clusters of files that are installed as a part of dedicated program installation event, then you exclude filighted files, then you exclude all media files that are size 100×100 or less, then you go a bit higher level and exclude less-recognizable clusters of software*, then you exclude… yes, you name it. With a lot of ideas like these we can beat down an ‘attack surface’ for manual analysis to a substantially smaller subset of evidence!

*And it’s time to prove that piracy is good (sometimes).

Many paid software packages are heavily guarded and can’t be accessed/downloaded without paying a licensee fee. Now, pirated files don’t have these restrictions and one could download them temporarily, look at their folder structures and file lists (even extract hashes of each file), and if the archive includes a parseable installer or archive – extract and analyze their content, preserve their file listing for the future, and continue adding them to a ‘known file’ database (both by hash and absolute/relative path and or file name/file size). With a large number of such clustered filelists one could potentially remove a lot of noise from the examined evidence.

If you are worried about piracy, it was just a clickbait. You can find many of the very same files available for download on many popular ‘upload your file and we will tell you if it is bad’ legitimate sites…

Now, this is not to imply that we will be doing the data reduction at all costs. Mistakes happen and data sets may be inaccurate. Supply chain attacks are real and an ‘allowlisted’ path coming from some old installers of Software X does not imply that the very same path in its current instantiation is clean. But… the idea is about cutting corners, making it easier to spot a smoking gun, and not to permanently remove the evidence from the view…

A license (metadata) to kill (for)…

Many forensic artifacts can be looked at from many different angles. A few years ago I proposed a concept of filighting that tried to solve a problem of finding unusual, orphaned and potentially malicious files dropped inside directories that contain files that DO NOT reference these orphaned files at all.

I really hope that forensic analysis tools will evolve to add more features that will help to automate file system analysis based not only on a list of known hashes and/or file extensions, but also paths, partial (relative) paths, file names, actual file types based on their content, and ideas that rely on more complex algorithms: using prebuilt artifacts collections, leveraging various correlations (ideas like filighting), and of course machine learning and AI.

Today I want to explore one more angle of looking at file system artifacts — classes of file content. There are many file formats out there: executables, documents, configuration files, database files, and many other file types. The classification I am focusing on today though is slightly different – the format itself doesn’t interest me too much, but the function of the file does…

My guinea pig will be a license file. The type of a file that is all over the place, but no one reads them. And yes, removing them from the examiner’s view (during file system analysis) may not add a lot of value, but it’s used here only to illustrate the idea. There are many other file classes like this that can be classified as noise to the examiners’ eyes and if we start clustering them together, who knows, maybe we have just saved some personhours there…

I asked myself the following question:

– having a file system in front of me, how do I find all license files on it?

There are at least a few approaches I can think of:

  • use hashes of known license files,
  • use file names typically used by license files,
  • analyze content of all files and look for content that resembles a license file.

All of them have their own challenges:

  • the first one needs a lot of prep work to collect good hashes,
  • the second one is hard to do w/o some proper analysis of a clean sampleset, and
  • the third one is the most reliable, but it’s slow & needs even more preparation because it has to take into account a few more aspects: localization issues (license in various languages), file encoding issues (Unicode variants, ASCII, MBCS), file formats (TXT, RTF, HTM(L), PDF, DOC(X), etc.), and of course — performance (reading many files to analyze their content is expensive, plus not every file referencing GPL, LGPL, GNU is a license file)

I am going to focus here on the second one.

Your typical license file is usually called license, license.txt, eula.txt, and in case of Open Source, we often see files named like gpl.txt, license.gpl.txt, lgpl.txt, etc.

When you start researching this file naming bit a bit more, you will soon realize that there are a lot of variations. A lot of issues listed in 3rd point come to play as well f.ex.:

  • file names can be localized,
  • file extensions can be .txt, .rtf, .htm(l), .doc(x), .pdf, .xml,
  • some of the file names have typos,
  • many license file names use various prefixes or suffixes that identify the licensed software, the language or code page identifying the language the license file is written in,
  • some file names may refer to compressed file names f.ex. *.tx_ (in installation packages),
  • some license files may be stored inside the archives (including password-protected files) or installers,
  • some licenses are embedded inside the compiled help files (.hlp, .chm),
  • some programs may be hiding the licensing information in files named with various infixes: copying, releasenotes, thirdparty, copyright, and their variants, etc.,
  • some may refer to software version in terms of full, trial f.ex. evaluation,
  • some files with a license in name often refer to actual software licensing (getting keys, subscription, transferring the licenses, etc.),
  • finally, some file names may be available in a 8.3 DOS notation only.

As usual, the more you look, the more complex the problem you see.

For this post I have compiled a large file containing possible license file names. You can download it here.

Will it make anybody’s life easier?

I don’t know.

What matters is that we learned a little bit more how difficult the process of automated file system analysis is. What started as a trivial and frivolous idea ended up being a Don Quixotish attempt to formalize something that is impossible to tackle, even with a data-heavy approach…