The quirks of Batch Processing

Processing large corpora of samples is a very interesting engineering project. Once started, it never ends. There are always new files to process, there is always code to add. It’s a great way to learn about files, in general.

You typically start with a basic script that helps to recognize a file type of processed samples. A known file type helps us to easily organize files into clusters. Of course, as you look at more and more file types, you start recognizing patterns that help to add additional info. File types we once considered atomic have subtypes now, and even subcategories, or are tagged in many ways. As you progress, sooner or later you will find yourself writing a full-blown file parser and content analysis tool.

You will encounter many file types that are no longer used, or for which parsing tools only exist on a specific platform, or file types for which writing a dedicated parser would take a few months. You will start cutting corners by adapting your parser to parse an output generated by other parsers. You will add standard hashes, imphash, fuzzy hashes, you will then start applying them to different section of files. Previously ignored file sections will be researched, codified. You will scan with AV, multithread, multiOS, multihost. You will collect, correlate, enrich, present. Each file will become a graph of properties and correlations.

You parser will become a full-blown Frankenstein’s Monster.

And while I described it in a very generic way, I am going to list a couple of gotchas that you will come across coding / running this thing.

  • Beware of a system command
    • anytime you execute a separate program via any of many variations of system command you may end up executing a program that resides in a sample’s directory;
      why?
      Windows system command relies on a cmd.exe program that is often executed blindly w/o specifying a full path; if it happens that your script is operating from within a directory where there is a sample called cmd.exe, that sample will be executed!
  • Beware of the Ampersands
    • On Windows system, if a file name you are processing includes ampersands (pretty common if you wget some files when you do some web crawling), if you pass it to the shell w/o escaping the ampersands you may end up executing multiple processes you didn’t plan to. For example:
  • Timeouts
    • If you utilize external programs they will very often hang. You come back and check the progress 12h later and realize the whole process stopped waiting for that external parser to finish its work; except it just got stuck in some never-ending loop!
  • File Typing is actually very hard
    • File extension is meaningless, but may be your last resort
    • Your ZIP recognition algorithm will thank you a lot if you can pull any extra information from a zipped file
      • Is it a Java Archive?
      • Is it a Chrome Plugin?
      • Is it a backup of photos?
      • Is it a ZIP attachment (e.g. ZIPSFX)
      • Is it password protected? Do ‘default’ protection passwords work?
    • Polyglot files are a thing
    • Web files can be hard to identify as they include keywords from many other languages; plus obfuscation; parsing this thing statically is impossible w/o a headless browser tool
    • ANSI vs. Unicode vs UTF8 vs. many other character encodings
    • UUEncode vs. BASE64 vs. many other data encodings
    • Is it a plain text, binary?
    • Is it English, Russian, Greek, Chinese?
    • What does the entropy tell us? How does it change across file content?
    • Does the file refer to any files provided in the same archive? Is there any link?
    • Is it packed/protected? Can it be statically unprotected? Unpacked?
    • What does a file content tell us?
      • Strings
      • Unicode Strings
      • MBCS Strings
      • Compiler
      • Linker
      • Section table
      • Import table
      • Export Table
      • .NET Metadata
      • Is it signed?
      • Is it an installer?
      • Any appended data?
      • Can files / sections inside be extracted statically (decompressor exists) and/or dynamically (unattended or guided installation)
      • Can the installation script be reconstructed?
      • Can we extract embedded files? Bitmaps, Icons, Movies, Strings, etc.
      • Any luck with Yara sigs?
      • What about a disassembly? decompilation?
      • Any compiler-specific metadata? (e.g. in Delphi files)
    • Is it malicious?
    • Is it a quarantined file?
    • Is it corrupted?
      • Not only based on a file structure inconsistencies; you may come across binary files that have a specific file type and format, but are saved… as Unicode; it breaks the format, but if you can recognize the type of corruption you may try to undo it and parse the recovered file
    • Any anomalies observed? (e.g. high number of sections; section size longer than a file size)
    • etc.

If you happen to be writing a tool like this, remember that it’s easier today than it was 10-15 years ago. We have tones of tools to rely on and lots of code available. Antivirus scans are easy wins, but then you have projects like pefile module, 7z, Universal Extractor, Resource Hacker in a CLI mode, hachoir, headless browsers, disassemblers, decompilers, virtual machines, etc.. Combining it together is hard, but possible, given time 🙂 Despite providing a long wishlist / description above, I have not implemented all of these things yet. And I’ve been coding it for 15 years. I guess 15 more to come 🙂

Anti- techniques refresh A.D. 2019

The old-school malware used to detects Reverse Engineering tools by looking for artifacts created by this type of software. The most common artifacts include Process Names, DLL Names, mutexes, files, Registry Entries, Window Classes / Titles. It’s actually really trivial to catch sysinternals tools, Wireshark, OllyDbg, IDA, etc. by using simple Windows API calls that find Window with a specific class/title…

Long, and always growing lists of ‘interesting’ window classes/titles used by these tools have been circulating within a cracking / malware community for many years & are kinda a standard now. So standard, sometimes they include artifacts as old as these created on Windows 9x (e.g. Softice references that are obsolete today).

Anyways….

I’ve been recently thinking of all these well-known tricks and it suddenly hit me that we don’t really hear much about software targeting newer tools on our scene:

It is handy to review these tools from an attacker perspective – we may be able to collect additional data points that can be easily converted into yara sigs, etc.. And of course, this is a new class of old-is-new-again tricks that may be out there and we are just not focusing on finding yet – a.k.a. potentially missing them.

IDA’s QT windows seem to be hard to spot using your standard window enumerations APIs — the drawing routines are all internal to QT and there are no native Window primitives used by the class other than a generic window belonging to class Qt5QWindowIcon. Still, we can query its window text though and if it contains .idb, or _i64 strings (that refer to IDA database file extensions) chances are high that IDA is running.

GHIDRA is Java-based so windows’ classes are related to it e.g. SunAwtDialog or SunAwtFrame. The Window titles will of course reveal references to the program name e.g. Ghidra: <project name>.

PE-Sieve is a command line tool, so there are no windows created, but it can still be spotted by looking at a process list. Any process with a pe-sieve in name should be a red flag.

Detect It Easy (DiE) is written in QT, and same as IDA doesn’t use native window hierarchy (just one class QWidget). Still, its window titles are revealing the name of a program e.g. Detect It Easy 1.01 or Die.

WinDbg with Time Traveling relies heavily on a bunch of DLLs that will be loaded into a target process:

  • ttdloader.dll
  • ttdplm.dll
  • ttdrecord.dll
  • ttdrecordcpu.dll
  • ttdwriter.dll

Detecting presence of any of these should work as a neat anti-debug trick. (note: I have not explored it enough yet, but it would seem that ttdrecordcpu.dll and ttdwriter.dll are always loaded into a debugged process; others are helper libraries & may not be present in a debuggee’s address space; need to run more tests).

XDBG is a great debugger and gaining more and more users as it breaks Olly’s hegemony when it comes to user mode code analysis – it offers a debugger for both 32- and 64-bit programs. And since it was written in QT as well, it kinda suffers from the same detection limitations as other programs I described above. Still, the window class Qt5QWindowIcon and the Window Title x32dbg or 642dbg give it away. Same goes for process names.

Fakenet-NG is a nice local network redirector. When it’s running, a service called WinDivert xxx is in operation, so it’s one way of to detect it. Others may include spotting boilerplate file content that is delivered on monitored ports — if an analyst forgot to edit these, the content returned by a local server is predictable and can be identified as a default FakeNet reply e.g.:

With regards to WireShark, there are tones of ways to detect it. The filenames in default install directory, the Registry entries, NPCAP/WinPCAP driver/service, Window class/title, the file extensions it takes over, etc. Notably, newer versions of Wireshark also use QT, so you can look for a Qt5QWindowIcon class with a title The Wireshark Network Analyzer.

Sysmon and EDRs is a completely different category. If you see them running — need to rely more on Lolbins, and/or other trickery (e.g. common whitelisting points i.e. directories whitelisted by EDRs/analysts often relying on ‘standard’ configs like e.g. SwiftOnSecurity). There is a growing body of knowledge that focuses on bypassing EDRs and it’s just a matter of time that it will become a de facto part of attackers’ toolkit. Bugs, clever bypasses, code patching etc. are on a rise. It’s also a time we create a curated list of artifacts that EDR tools give away: program locations, processes running, Registry keys, services, etc.

For obvious reasons, I am not listing all artifacts, to make it a little bit harder for the bad guys, but these are potential detections capabilities out there and it’s good to keep them in mind. Not only they can help malware to detect defenders tools, but may also be useful for sandbox vendors / SOC analysts to identify sample behavioral traits as well.

I guess, the Cat and Mouse game continues?