Minority (forensic) report aka defending forward w/o hacking back

We love to put a wedge between the detection and response. Many of us kinda agree that telemetry analysis is one thing and the actual forensic examination of the evidence is another.

However…

In this post I will try to turn the tables a bit.

Prefetch files

All the command line tricks you use to execute your executable, script, load your library, etc. will end up with Prefetch entries created on the victim system.

Look at clusters of Prefetch files created within small time intervals and you will find interesting sets of prefetch file clusters related to a bundle of programs executed one after another – to pivot from.

Most Recently Used

This Magnet Forensics article describes a number of forensic artifacts related to Most Recently Used files/commands and RecentDocs that we can track with telemetry.

By careful examination of these, we can detect a lot of user-driven activity that can give us hints about what is happening on a system as a result of user activities, and ignore all the telemetry noise coming from OS and legitimate software activity.

Jump List App ID analysis

Again, instead of concentrating on the input (obfuscated command lines leading to execution of many lolbins, executables, often nested), we focus on analysis of files with the following extensions:

  • .automaticDestinations-ms
  • .customDestinations-ms

We can use exclusions like this list. We can also use techniques like Least Frequency Occurrence (LFO) to focus on outliers. We can pivot from them and see what happens on the investigated system prior to these artifacts being created.

Application crashes

Telemetry can tell us about a lot of interesting events. One of them is an application crash.

Anytime we see an invocation of werfault.exe, or creation of a .dmp file, we should look at the events preceding this event.

Persistence

I wrote a lot about this topic and it still stands true. Collect telemetry related to these generic load points and you will find some ‘bad’ in no time.

Quarantine files

My old tool DeXRAY is still being used in 2025. I am quite shocked, but also pleased. What it means in practice though is that:

  • companies still use antivirus software
  • some companies still use more than 1 endpoint security control per endpoint
  • detection of a quarantine file being created is a good pivot point for some additional digging

Lnk files

Surprisingly, they are not created that often, so they are an interesting artifact to look at.

They may point to an insider threat, they may point to a malware, or make us waste some triage time leading us to nowhere (list of exclusions should be easy to build though: ‘What's New.lnk', ‘About <program>.lnk‘, ‘Uninstall <program>.lnk‘, ‘Magnify.lnk‘, ‘Narrator.lnk‘, etc.).

Background Activity Moderator (BAM)

This is an obvious candidate for monitoring as it references the user’s SID and can tell us who actually executed that particular program.

Monitoring the entries in this Registry branch can help us to detect anomalies when we start seeing execution of unusual processes.

Unusual File Creation activity

In many posts in the past I have highlighted a lot of DLL side-loading techniques that are quite unusual f.ex.:

Detecting file operations associated with these unusual side-loading activities is a good way to detect more advanced attackers (and yes, many of them are actively using some of my techniques!).

There are probably more forensic artifacts that we can monitor for some early detections, but the set above should give you some pointers…

Malware Source code string extraction

Every once in a while we put our hands on a source code corpora of some malware (thx vx-underground!). Whether it is a quality release or not, we don’t care, because we know we usually get a kinda mixed bag of data and code – and as long as it leans towards ‘new’ and ‘quality’ we still benefit from getting access to some of this ‘bad’ code – typically written in C, C++, .NET, Go, Rust, or… AutoIt.

No matter what the language of choice though, we always want to start such code corpora analysis by cherry-picking the low hanging fruits first…

One way to do it is by extracting all strings referenced by the leaked source code. This is because some of these extracted strings are so unique that they can form a set of perfect (unique) IOCs. It’s not surprising then that having a proper methodology in place to identify this sort of artifacts quickly is very important – everyone loves quick, impactful wins.

But what is a source code file, you may ask?

Depending on the era you are from, your preferred OS, programming language, compiler, IDE… it can mean a lot of different things. Even in 2025 there are many people today who still program in Visual Basic for Applications, Visual Basic Script, Perl, and even Cobol, Fortran, or not-so-old Delphi, while many (more modern programmers) can’t live without Go, Rust, Nim, and Python. And then some other folks still make a paycheck living off .bat, .cmd and .vbs files despite the fact the Windows sysadmin world pretty much endorsed the PowerShell’s power and moved on from 90s to now like… 10-15 years ago. And then some OGs still maintain HTA scripts, some still write multi platform code in C, some live assembly all day long, and some more recent coders often don’t even know what they are doing (copying and pasting chatGPT-generated code to their consoles hoping it can do all the magic for them). And we should not forget the files that describe installers’ inner workings, compilation process, linking process, and others, where the scripting/coding capabilities are still present, but may not be immediately apparent.

What’s constant about all these use cases listed above is that most of the files created by both conservative programmers of 70s, 80s, 90s, 2000s, and more ‘modern’ code generated by the children of 2010s and 2020s almost inevitably end up being saved to files with a predictable file extension. Malware authors are bound by the spell of file extensions, too. And even the most conservative macOS and Linux users cannot escape this predictable behavior, and thanks to that, we can still make an attempt to build an ultimate list of file extensions that refer to programming activities in one way or another, one that covers all the modern desktop OSes. And while we do that, we intentionally exclude HTML and CSS files and their derivatives. They are very ‘spammy’ in nature, and the code present in these files is (most of the time) not a ‘real’ code.

Why do we pay so much attention to file extensions, you may ask? Large corpora of source code has a very peculiar problem that we need to solve: there are too many files too look at.

Here’s a histogram of file extensions from the repo referenced above:

================
85088
================
png	26631
ico	18024
dll	6070
exe	3630
bmp	2926
au3	2429
js	2205
txt	1978
html	1845
smali	1611
skn	1608
7z	1548
gif	872
svg	786
ini	650
css	608
md	603
scss	599
wav	585
xaml	500
xml	475
jpg	467
class	463
dat	411
ocx	374
asz	370
jar	362
pdb	293
ps1	269
	200
pl	178
bin	173
ttf	173
db	159
cs	153
php	150
inf	149
config	142
json	141
lng	130
zip	118
pak	114

We can easily discard media files, compiled binary objects, libraries, executables, font files, but if we really want quick wins we need to stay very focused.

So…

We should ask: what are the file extensions that are related to programming activities in the year 2025?

It’s actually a very long list…:

  • accdb
  • ace
  • ahk
  • app
  • appxmanifest
  • asm
  • asp
  • aspx
  • au3
  • awk
  • backup
  • bak
  • bas
  • bat
  • bdsproj
  • btm
  • c
  • cbl
  • cc
  • cfg
  • cgi
  • cls
  • cmakelists.txt
  • cmd
  • cnf
  • cob
  • conf
  • config
  • cpp
  • cppm
  • cs
  • csproj
  • cu
  • cxx
  • dcu
  • def
  • dfm
  • dlg
  • doc
  • docm
  • docx
  • dpc
  • dpj
  • dpk
  • dpr
  • dproj
  • dtd
  • dxp
  • e
  • eng
  • f
  • flt
  • fmt
  • for
  • fp
  • frm
  • frx
  • gnumakefile
  • go
  • h
  • hdl
  • hh
  • hhp
  • hid
  • hpp
  • hrc
  • hta
  • hwp
  • hwpx
  • hxx
  • idl
  • inc
  • inf
  • info
  • ini
  • ins
  • iss
  • ixx
  • java
  • js
  • jse
  • json
  • jsp
  • jsproj
  • jqs
  • kdelnk
  • l
  • lfm
  • lgt
  • ll
  • lng
  • lnk
  • lnx
  • lpk
  • lpr
  • lst
  • m
  • mac
  • macos
  • makefile
  • manifest
  • map
  • md
  • mdb
  • mk
  • mod
  • myapp
  • nfo
  • odf
  • odg
  • odp
  • ods
  • odt
  • nsi
  • par
  • pas
  • pdf
  • php
  • php3
  • pl
  • pm
  • pmk
  • policy
  • pp
  • pps
  • ppt
  • pptx
  • pre
  • prj
  • properties
  • ps
  • ps1
  • py
  • r
  • rb
  • rc
  • rdb
  • rdme
  • reg
  • resources
  • s
  • sbl
  • scp
  • sdi
  • seg
  • settings
  • sh
  • sln
  • smali
  • smf
  • sms
  • source
  • sql
  • src
  • swift
  • tag
  • toml
  • unx
  • url
  • vb
  • vba
  • vbe
  • vbp
  • vbproj
  • vbs
  • vcxproj
  • wsh
  • xaml
  • xfm
  • xls
  • xlsm
  • xlsx
  • xml
  • y
  • yaml
  • yml
  • yxx
  • ~ddp
  • ~dfm

It’s a long list and it’s a decent list, even if it will never be ‘final’. It covers very old programming languages, it covers many file extensions used by decades-long iterations of popular programming languages, it covers both compiled and interpreted languages, it covers commercial & open-source programming projects. It covers Microsoft Office, Open/Libre Office, Hangul Office macros, Mathlab, it covers configuration files, make files, it covers data files, it covers project files, resource files, header files, definition files, localization files, etc. etc. Most of data/code these files store are saved in a plain text format, but then of course, some store them in a compressed, encoded or otherwise non-trivial to extract form.

Coming back to the topic of this post… analysing large data sets that include source code of many malware families that are made available via leaks or releases of curated collections may sometimes feel like a very mundane and counterproductive task, but the approach I want to propose here can give us tangible results very quickly.

For instance, extracting all the quoted strings from a large corpora of malicious source code files allows us to quickly identify many hardcoded file names used by the malware. These file names can be then used to quickly detect malware-related activity within a EDR/XDR telemetry. Queries focused on these hardcoded artifacts will help to detect the actual infections + the unwanted activities of employees (possible insider threat) who are downloading such malicious repos to their corporate devices thinking this is an acceptable way of ‘analysing’ malicious data (it’s usually not as it is an Acceptable Use Policy (AUP) violation in most of these cases).

Now that we have all this administrative fluff out of the way, let’s do some quick data crunching.

After unpacking all the archives present inside the sampleset referenced by the first paragraph of this post, we look at all quoted strings referenced by the source code found inside the files with extensions belonging to the ‘programming file extensions’ set listed above. We then narrow down our attention to look for .txt file names that we extract from that string set, and then we manually eyeball them all to quickly build a list of interesting artifacts.

If we are properly prepared, it takes no more than 30 minutes to quickly extract interesting forensic artifacts from such a large source code corpora. Another 30 to eyeball the results and… 4h to write this blog post.