Malware Source code string extraction

Every once in a while we put our hands on a source code corpora of some malware (thx vx-underground!). Whether it is a quality release or not, we don’t care, because we know we usually get a kinda mixed bag of data and code – and as long as it leans towards ‘new’ and ‘quality’ we still benefit from getting access to some of this ‘bad’ code – typically written in C, C++, .NET, Go, Rust, or… AutoIt.

No matter what the language of choice though, we always want to start such code corpora analysis by cherry-picking the low hanging fruits first…

One way to do it is by extracting all strings referenced by the leaked source code. This is because some of these extracted strings are so unique that they can form a set of perfect (unique) IOCs. It’s not surprising then that having a proper methodology in place to identify this sort of artifacts quickly is very important – everyone loves quick, impactful wins.

But what is a source code file, you may ask?

Depending on the era you are from, your preferred OS, programming language, compiler, IDE… it can mean a lot of different things. Even in 2025 there are many people today who still program in Visual Basic for Applications, Visual Basic Script, Perl, and even Cobol, Fortran, or not-so-old Delphi, while many (more modern programmers) can’t live without Go, Rust, Nim, and Python. And then some other folks still make a paycheck living off .bat, .cmd and .vbs files despite the fact the Windows sysadmin world pretty much endorsed the PowerShell’s power and moved on from 90s to now like… 10-15 years ago. And then some OGs still maintain HTA scripts, some still write multi platform code in C, some live assembly all day long, and some more recent coders often don’t even know what they are doing (copying and pasting chatGPT-generated code to their consoles hoping it can do all the magic for them). And we should not forget the files that describe installers’ inner workings, compilation process, linking process, and others, where the scripting/coding capabilities are still present, but may not be immediately apparent.

What’s constant about all these use cases listed above is that most of the files created by both conservative programmers of 70s, 80s, 90s, 2000s, and more ‘modern’ code generated by the children of 2010s and 2020s almost inevitably end up being saved to files with a predictable file extension. Malware authors are bound by the spell of file extensions, too. And even the most conservative macOS and Linux users cannot escape this predictable behavior, and thanks to that, we can still make an attempt to build an ultimate list of file extensions that refer to programming activities in one way or another, one that covers all the modern desktop OSes. And while we do that, we intentionally exclude HTML and CSS files and their derivatives. They are very ‘spammy’ in nature, and the code present in these files is (most of the time) not a ‘real’ code.

Why do we pay so much attention to file extensions, you may ask? Large corpora of source code has a very peculiar problem that we need to solve: there are too many files too look at.

Here’s a histogram of file extensions from the repo referenced above:

================
85088
================
png	26631
ico	18024
dll	6070
exe	3630
bmp	2926
au3	2429
js	2205
txt	1978
html	1845
smali	1611
skn	1608
7z	1548
gif	872
svg	786
ini	650
css	608
md	603
scss	599
wav	585
xaml	500
xml	475
jpg	467
class	463
dat	411
ocx	374
asz	370
jar	362
pdb	293
ps1	269
	200
pl	178
bin	173
ttf	173
db	159
cs	153
php	150
inf	149
config	142
json	141
lng	130
zip	118
pak	114

We can easily discard media files, compiled binary objects, libraries, executables, font files, but if we really want quick wins we need to stay very focused.

So…

We should ask: what are the file extensions that are related to programming activities in the year 2025?

It’s actually a very long list…:

  • accdb
  • ace
  • ahk
  • app
  • appxmanifest
  • asm
  • asp
  • aspx
  • au3
  • awk
  • backup
  • bak
  • bas
  • bat
  • bdsproj
  • btm
  • c
  • cbl
  • cc
  • cfg
  • cgi
  • cls
  • cmakelists.txt
  • cmd
  • cnf
  • cob
  • conf
  • config
  • cpp
  • cppm
  • cs
  • csproj
  • cu
  • cxx
  • dcu
  • def
  • dfm
  • dlg
  • doc
  • docm
  • docx
  • dpc
  • dpj
  • dpk
  • dpr
  • dproj
  • dtd
  • dxp
  • e
  • eng
  • f
  • flt
  • fmt
  • for
  • fp
  • frm
  • frx
  • gnumakefile
  • go
  • h
  • hdl
  • hh
  • hhp
  • hid
  • hpp
  • hrc
  • hta
  • hwp
  • hwpx
  • hxx
  • idl
  • inc
  • inf
  • info
  • ini
  • ins
  • iss
  • ixx
  • java
  • js
  • jse
  • json
  • jsp
  • jsproj
  • jqs
  • kdelnk
  • l
  • lfm
  • lgt
  • ll
  • lng
  • lnk
  • lnx
  • lpk
  • lpr
  • lst
  • m
  • mac
  • macos
  • makefile
  • manifest
  • map
  • md
  • mdb
  • mk
  • mod
  • myapp
  • nfo
  • odf
  • odg
  • odp
  • ods
  • odt
  • nsi
  • par
  • pas
  • pdf
  • php
  • php3
  • pl
  • pm
  • pmk
  • policy
  • pp
  • pps
  • ppt
  • pptx
  • pre
  • prj
  • properties
  • ps
  • ps1
  • py
  • r
  • rb
  • rc
  • rdb
  • rdme
  • reg
  • resources
  • s
  • sbl
  • scp
  • sdi
  • seg
  • settings
  • sh
  • sln
  • smali
  • smf
  • sms
  • source
  • sql
  • src
  • swift
  • tag
  • toml
  • unx
  • url
  • vb
  • vba
  • vbe
  • vbp
  • vbproj
  • vbs
  • vcxproj
  • wsh
  • xaml
  • xfm
  • xls
  • xlsm
  • xlsx
  • xml
  • y
  • yaml
  • yml
  • yxx
  • ~ddp
  • ~dfm

It’s a long list and it’s a decent list, even if it will never be ‘final’. It covers very old programming languages, it covers many file extensions used by decades-long iterations of popular programming languages, it covers both compiled and interpreted languages, it covers commercial & open-source programming projects. It covers Microsoft Office, Open/Libre Office, Hangul Office macros, Mathlab, it covers configuration files, make files, it covers data files, it covers project files, resource files, header files, definition files, localization files, etc. etc. Most of data/code these files store are saved in a plain text format, but then of course, some store them in a compressed, encoded or otherwise non-trivial to extract form.

Coming back to the topic of this post… analysing large data sets that include source code of many malware families that are made available via leaks or releases of curated collections may sometimes feel like a very mundane and counterproductive task, but the approach I want to propose here can give us tangible results very quickly.

For instance, extracting all the quoted strings from a large corpora of malicious source code files allows us to quickly identify many hardcoded file names used by the malware. These file names can be then used to quickly detect malware-related activity within a EDR/XDR telemetry. Queries focused on these hardcoded artifacts will help to detect the actual infections + the unwanted activities of employees (possible insider threat) who are downloading such malicious repos to their corporate devices thinking this is an acceptable way of ‘analysing’ malicious data (it’s usually not as it is an Acceptable Use Policy (AUP) violation in most of these cases).

Now that we have all this administrative fluff out of the way, let’s do some quick data crunching.

After unpacking all the archives present inside the sampleset referenced by the first paragraph of this post, we look at all quoted strings referenced by the source code found inside the files with extensions belonging to the ‘programming file extensions’ set listed above. We then narrow down our attention to look for .txt file names that we extract from that string set, and then we manually eyeball them all to quickly build a list of interesting artifacts.

If we are properly prepared, it takes no more than 30 minutes to quickly extract interesting forensic artifacts from such a large source code corpora. Another 30 to eyeball the results and… 4h to write this blog post.

Hunting for the warez & other dodgy stuff people install / download, part 2

In the first part of this series we explored some basic search terms that can be used to find ‘unwanted’ software being installed on company endpoints. Today, I’d like to take this research a step further and look at other ‘questionable content’.

People download pirated video content from many questionable places. Finding these downloads is not difficult because lots of this activity will reference multimedia files with the following extensions:

  • ‘3g2’, ‘3gp’, ‘amv’, ‘asf’, ‘avi’, ‘bdjo’, ‘bdmv’, ‘clpi’, ‘divx’, ‘drc’, ‘f4a’, ‘f4b’, ‘f4p’, ‘f4v’, ‘flv’, ‘gif’, ‘gifv’, ‘M2TS’, ‘m2v’, ‘m4p’, ‘m4v’, ‘mkv’, ‘mng’, ‘mov’, ‘mp2’, ‘mp4’, ‘mpe’, ‘mpeg’, ‘mpg’, ‘mpls’, ‘mpv’, ‘MTS’, ‘mxf’, ‘nsv’, ‘ogg’, ‘ogv’, ‘qt’, ‘rm’, ‘rmvb’, ‘roq’, ‘svi’, ‘TS’, ‘viv’, ‘vob’, ‘webm’, ‘wmv’, ‘yuv’

As you can guess, searching for file creation events referencing these media file extensions is a good way to discover users that download multimedia content that may need to be reviewed.

And as usual, if we dig deeper, we can create complementary control detection logic that focuses on a different file extension set – one that is VERY attached to pirated video media content:

  • ass – Advanced Sub Station Alpha
  • dfxp – Flash XML (Distribution Format Exchange Profile)
  • inqscr – InqScribe transcript
  • itt – iTunes Timed Text
  • jss – JACOsub
  • sami – Synchronized Accessible Media Interchange
  • sbv – YouTube format
  • scc – Scenarist Closed Captions
  • smi – Synchronized Accessible Media Interchange
  • srt – SubRip
  • ssa – Sub Station Alpha
  • stl – Spruce Subtitle File
  • sup – Blu-ray PGS
  • sup – SonicDVD Creater
  • ttml – Timed Text Markup Language
  • usf – Universal Subtitle Format
  • vtt – Web Video Text Tracks (WebVTT)

If you don’t know what these are, where have you been for the last 3 decades?? 🙂

These are subtitle files that often accompany the pirated media files. So, it goes without saying that a presence of these files can be seen as a low hanging fruit that can lead us to discovering other undesirable goodies in the folders that host them.

Another type of warez files we should look at are archives.

I mentioned them a few times in the past, but let’s be more systematic this time and focus on the telemetry referencing the container files created by the most popular archiving software very often used by the ‘scene’ that ‘releases’ warez to the public:

  • .rar, .7z, .zip, .cab, and
  • .arj, .lha, .kgb, .xz, and
  • multi-volume archives like
    • .7z.000, .7z.001, …,
    • .rar.000 .rar.001, …,
    • part1.rar, part2.rar, …
    • .r.01, .r.02, …,
    • .z.01, z.02, …,
    • .z01, .z02, …,
    • zx01, zx02, …,
    • .zip.001, .zip.002, …,
    • .cab, .part2.cab, …,
    • and older, or less common file archives: https://en.wikipedia.org/wiki/List_of_archive_formats

Hunting for file creation events that refer to files with these file extensions may lead to some very interesting discoveries.

And yes, as usual, there is more:

  • Any file creation event referencing .torrent file extension is of interest
  • Any command line invocation referencing “magnet:” link is of interest
  • Any DNS requests related to known torrent/magnet sites are of interest

As we explore this particular topic we may get tempted to leverage this approach to hunt for more specific content like pr0n & CSAM, but I do not want to cover these here, because handling these properly requires a completely different approach – one that is better left to experienced DFIR teams working together with Legal and HR departments. And that’s because in cases of True Positives employees lose jobs, and/or go to prison.

Now… as we come to the end of this quick & dirty hunting guide, I need to be fair and mention a little caveat. While hunting for Acceptable Use Policy violations is pretty easy, the actual remediation is extremely difficult. Some of these findings (and often in bulk) end up as items added to the company’s Risk Register. And anything that is listed there ends up being prioritized – AUP violations are always marked LOW on that priority list. Moreso, exploring AUPs in your environment will inevitably lead you to discover AUPs committed by the security personnel, including CISOs. There is no clear way to solve it long-term without some serious commitment of company’s security committee…