You are browsing the archive for Clustering.

PDB Goodness

August 31, 2019 in Clustering, PDB Paths

In a recently published Definitive Dossier of Devilish Debug Details, Steve Miller is going on a very entertaining adventure of looking at PDB paths of known malware campaigns and authors. I love this article, because I have always felt that PDB is a great forensic artifact, often overlooked, and even if I did some research on it in the past myself, I have never seen a comprehensive study on a level that Steve delivered.

Inspired by it, I had a quick look at PDB paths of… primarily clean files. I am saying primarily, because while I am almost certain that most of them are clean one can never be sure 100%… To support the claim, I can list a couple of paths I found in this (allegedly) clean corpora suggesting that clean probably means different things to different people:

  • D:\TEMP\fuckingasus\Debug\fuckingasus.pdb
  • D:\Work\pgtool\svn\pgtoolfuck\Release\RTNicPgW32.pdb
  • D:\Work\pgtool\svn\pgtoolfuck\x64\Release\RTNicPgW64.pdb
  • d:\tmp\1driver\fuck4\rtl818xb\platform\ndis6\usb\objfre_wlh_x86\i386\rtl8187.pdb
  • C:\TMP\shit\msikbd.2k\objfre\i386\msikbd2k.pdb
  • c:\WORK\XPSDriver\oishitts_view\oishitts_xpsdrv093_051208_build\XPSRenderer092\xpsdriver\AquaFilter\Release\Win32\AquaFilter.pdb
  • C:\Users\lol g\Desktop\PowerBiosServer_20561\PowerBiosServer_20080428\PowerBiosServer\obj\Release\PowerBiosServer.pdb

I still believe that most of these are clean, and… perhaps an honest mistake made these paths incorporated into final executables ;), and who knows… maybe even some of them got signed 😉

Looking at all these paths we can draw some quick conclusions:

  • We could use them to generate a bunch of good yara signatures that catch good stuff; helps with clustering
  • Of course, since the file is now public, it means that bad guys could re-use existing paths to bypass aforementioned potential yara sigs by making them trigger on bad stuff pretending to be a good stuff
  • We see that Perforce, SVN, CVS, GIT are popular repos and perhaps their presence can indicate a proper software development practice at a company that generated the executables (could this alone be a good indicator for determining if the file is benign?)
  • Lots of different programming languages in use
  • Lots of personal build environments (1K user unique names under c:\users folder alone!)
  • Some coders compiled programs under an Administrator account (in fairness, my corpora are files between 2000-2019, so plenty of files come from the old-school times when Admin was a default for everything)
  • There are traces of some beautiful build environments out there; seriously, these are symptoms of very mature development practices visible directly in some of these PDB paths (their clusters)
  • Surprisingly, many paths are outside of C: drive — could this be a generic indicator of ‘good’ too?
  • Also, some of the usernames are clearly test-related; I am curious if these are overlooked in a final build, or some files were ‘leaked’? (test, Test, SKtester, nbtester, cvcctest, Pretest, tester, test5, TestUser, Test2, Test05, TestPC, Pinocchio_test)
  • We have users from all over the place: English/American, Chinese, Indian, Irish, French, Korean, Russian, Arabic, etc.

You can download a zipped archive with PDB paths here.

Note: This file is watermarked; you cannot use it for commercial purposes.

The quirks of Batch Processing

August 4, 2019 in Batch Analysis, Clustering

Processing large corpora of samples is a very interesting engineering project. Once started, it never ends. There are always new files to process, there is always code to add. It’s a great way to learn about files, in general.

You typically start with a basic script that helps to recognize a file type of processed samples. A known file type helps us to easily organize files into clusters. Of course, as you look at more and more file types, you start recognizing patterns that help to add additional info. File types we once considered atomic have subtypes now, and even subcategories, or are tagged in many ways. As you progress, sooner or later you will find yourself writing a full-blown file parser and content analysis tool.

You will encounter many file types that are no longer used, or for which parsing tools only exist on a specific platform, or file types for which writing a dedicated parser would take a few months. You will start cutting corners by adapting your parser to parse an output generated by other parsers. You will add standard hashes, imphash, fuzzy hashes, you will then start applying them to different section of files. Previously ignored file sections will be researched, codified. You will scan with AV, multithread, multiOS, multihost. You will collect, correlate, enrich, present. Each file will become a graph of properties and correlations.

You parser will become a full-blown Frankenstein’s Monster.

And while I described it in a very generic way, I am going to list a couple of gotchas that you will come across coding / running this thing.

  • Beware of a system command
    • anytime you execute a separate program via any of many variations of system command you may end up executing a program that resides in a sample’s directory;
      why?
      Windows system command relies on a cmd.exe program that is often executed blindly w/o specifying a full path; if it happens that your script is operating from within a directory where there is a sample called cmd.exe, that sample will be executed!
  • Beware of the Ampersands
    • On Windows system, if a file name you are processing includes ampersands (pretty common if you wget some files when you do some web crawling), if you pass it to the shell w/o escaping the ampersands you may end up executing multiple processes you didn’t plan to. For example:
  • Timeouts
    • If you utilize external programs they will very often hang. You come back and check the progress 12h later and realize the whole process stopped waiting for that external parser to finish its work; except it just got stuck in some never-ending loop!
  • File Typing is actually very hard
    • File extension is meaningless, but may be your last resort
    • Your ZIP recognition algorithm will thank you a lot if you can pull any extra information from a zipped file
      • Is it a Java Archive?
      • Is it a Chrome Plugin?
      • Is it a backup of photos?
      • Is it a ZIP attachment (e.g. ZIPSFX)
      • Is it password protected? Do ‘default’ protection passwords work?
    • Polyglot files are a thing
    • Web files can be hard to identify as they include keywords from many other languages; plus obfuscation; parsing this thing statically is impossible w/o a headless browser tool
    • ANSI vs. Unicode vs UTF8 vs. many other character encodings
    • UUEncode vs. BASE64 vs. many other data encodings
    • Is it a plain text, binary?
    • Is it English, Russian, Greek, Chinese?
    • What does the entropy tell us? How does it change across file content?
    • Does the file refer to any files provided in the same archive? Is there any link?
    • Is it packed/protected? Can it be statically unprotected? Unpacked?
    • What does a file content tell us?
      • Strings
      • Unicode Strings
      • MBCS Strings
      • Compiler
      • Linker
      • Section table
      • Import table
      • Export Table
      • .NET Metadata
      • Is it signed?
      • Is it an installer?
      • Any appended data?
      • Can files / sections inside be extracted statically (decompressor exists) and/or dynamically (unattended or guided installation)
      • Can the installation script be reconstructed?
      • Can we extract embedded files? Bitmaps, Icons, Movies, Strings, etc.
      • Any luck with Yara sigs?
      • What about a disassembly? decompilation?
      • Any compiler-specific metadata? (e.g. in Delphi files)
    • Is it malicious?
    • Is it a quarantined file?
    • Is it corrupted?
      • Not only based on a file structure inconsistencies; you may come across binary files that have a specific file type and format, but are saved… as Unicode; it breaks the format, but if you can recognize the type of corruption you may try to undo it and parse the recovered file
    • Any anomalies observed? (e.g. high number of sections; section size longer than a file size)
    • etc.

If you happen to be writing a tool like this, remember that it’s easier today than it was 10-15 years ago. We have tones of tools to rely on and lots of code available. Antivirus scans are easy wins, but then you have projects like pefile module, 7z, Universal Extractor, Resource Hacker in a CLI mode, hachoir, headless browsers, disassemblers, decompilers, virtual machines, etc.. Combining it together is hard, but possible, given time 🙂 Despite providing a long wishlist / description above, I have not implemented all of these things yet. And I’ve been coding it for 15 years. I guess 15 more to come 🙂