The quirks of Batch Processing

August 4, 2019 in Batch Analysis, Clustering

Processing large corpora of samples is a very interesting engineering project. Once started, it never ends. There are always new files to process, there is always code to add. It’s a great way to learn about files, in general.

You typically start with a basic script that helps to recognize a file type of processed samples. A known file type helps us to easily organize files into clusters. Of course, as you look at more and more file types, you start recognizing patterns that help to add additional info. File types we once considered atomic have subtypes now, and even subcategories, or are tagged in many ways. As you progress, sooner or later you will find yourself writing a full-blown file parser and content analysis tool.

You will encounter many file types that are no longer used, or for which parsing tools only exist on a specific platform, or file types for which writing a dedicated parser would take a few months. You will start cutting corners by adapting your parser to parse an output generated by other parsers. You will add standard hashes, imphash, fuzzy hashes, you will then start applying them to different section of files. Previously ignored file sections will be researched, codified. You will scan with AV, multithread, multiOS, multihost. You will collect, correlate, enrich, present. Each file will become a graph of properties and correlations.

You parser will become a full-blown Frankenstein’s Monster.

And while I described it in a very generic way, I am going to list a couple of gotchas that you will come across coding / running this thing.

  • Beware of a system command
    • anytime you execute a separate program via any of many variations of system command you may end up executing a program that resides in a sample’s directory;
      Windows system command relies on a cmd.exe program that is often executed blindly w/o specifying a full path; if it happens that your script is operating from within a directory where there is a sample called cmd.exe, that sample will be executed!
  • Beware of the Ampersands
    • On Windows system, if a file name you are processing includes ampersands (pretty common if you wget some files when you do some web crawling), if you pass it to the shell w/o escaping the ampersands you may end up executing multiple processes you didn’t plan to. For example:
  • Timeouts
    • If you utilize external programs they will very often hang. You come back and check the progress 12h later and realize the whole process stopped waiting for that external parser to finish its work; except it just got stuck in some never-ending loop!
  • File Typing is actually very hard
    • File extension is meaningless, but may be your last resort
    • Your ZIP recognition algorithm will thank you a lot if you can pull any extra information from a zipped file
      • Is it a Java Archive?
      • Is it a Chrome Plugin?
      • Is it a backup of photos?
      • Is it a ZIP attachment (e.g. ZIPSFX)
      • Is it password protected? Do ‘default’ protection passwords work?
    • Polyglot files are a thing
    • Web files can be hard to identify as they include keywords from many other languages; plus obfuscation; parsing this thing statically is impossible w/o a headless browser tool
    • ANSI vs. Unicode vs UTF8 vs. many other character encodings
    • UUEncode vs. BASE64 vs. many other data encodings
    • Is it a plain text, binary?
    • Is it English, Russian, Greek, Chinese?
    • What does the entropy tell us? How does it change across file content?
    • Does the file refer to any files provided in the same archive? Is there any link?
    • Is it packed/protected? Can it be statically unprotected? Unpacked?
    • What does a file content tell us?
      • Strings
      • Unicode Strings
      • MBCS Strings
      • Compiler
      • Linker
      • Section table
      • Import table
      • Export Table
      • .NET Metadata
      • Is it signed?
      • Is it an installer?
      • Any appended data?
      • Can files / sections inside be extracted statically (decompressor exists) and/or dynamically (unattended or guided installation)
      • Can the installation script be reconstructed?
      • Can we extract embedded files? Bitmaps, Icons, Movies, Strings, etc.
      • Any luck with Yara sigs?
      • What about a disassembly? decompilation?
      • Any compiler-specific metadata? (e.g. in Delphi files)
    • Is it malicious?
    • Is it a quarantined file?
    • Is it corrupted?
      • Not only based on a file structure inconsistencies; you may come across binary files that have a specific file type and format, but are saved… as Unicode; it breaks the format, but if you can recognize the type of corruption you may try to undo it and parse the recovered file
    • Any anomalies observed? (e.g. high number of sections; section size longer than a file size)
    • etc.

If you happen to be writing a tool like this, remember that it’s easier today than it was 10-15 years ago. We have tones of tools to rely on and lots of code available. Antivirus scans are easy wins, but then you have projects like pefile module, 7z, Universal Extractor, Resource Hacker in a CLI mode, hachoir, headless browsers, disassemblers, decompilers, virtual machines, etc.. Combining it together is hard, but possible, given time 🙂 Despite providing a long wishlist / description above, I have not implemented all of these things yet. And I’ve been coding it for 15 years. I guess 15 more to come 🙂

Share this :)

Comments are closed.