You are browsing the archive for Clustering.

The quirks of Batch Processing

August 4, 2019 in Batch Analysis, Clustering

Processing large corpora of samples is a very interesting engineering project. Once started, it never ends. There are always new files to process, there is always code to add. It’s a great way to learn about files, in general.

You typically start with a basic script that helps to recognize a file type of processed samples. A known file type helps us to easily organize files into clusters. Of course, as you look at more and more file types, you start recognizing patterns that help to add additional info. File types we once considered atomic have subtypes now, and even subcategories, or are tagged in many ways. As you progress, sooner or later you will find yourself writing a full-blown file parser and content analysis tool.

You will encounter many file types that are no longer used, or for which parsing tools only exist on a specific platform, or file types for which writing a dedicated parser would take a few months. You will start cutting corners by adapting your parser to parse an output generated by other parsers. You will add standard hashes, imphash, fuzzy hashes, you will then start applying them to different section of files. Previously ignored file sections will be researched, codified. You will scan with AV, multithread, multiOS, multihost. You will collect, correlate, enrich, present. Each file will become a graph of properties and correlations.

You parser will become a full-blown Frankenstein’s Monster.

And while I described it in a very generic way, I am going to list a couple of gotchas that you will come across coding / running this thing.

  • Beware of a system command
    • anytime you execute a separate program via any of many variations of system command you may end up executing a program that resides in a sample’s directory;
      Windows system command relies on a cmd.exe program that is often executed blindly w/o specifying a full path; if it happens that your script is operating from within a directory where there is a sample called cmd.exe, that sample will be executed!
  • Beware of the Ampersands
    • On Windows system, if a file name you are processing includes ampersands (pretty common if you wget some files when you do some web crawling), if you pass it to the shell w/o escaping the ampersands you may end up executing multiple processes you didn’t plan to. For example:
  • Timeouts
    • If you utilize external programs they will very often hang. You come back and check the progress 12h later and realize the whole process stopped waiting for that external parser to finish its work; except it just got stuck in some never-ending loop!
  • File Typing is actually very hard
    • File extension is meaningless, but may be your last resort
    • Your ZIP recognition algorithm will thank you a lot if you can pull any extra information from a zipped file
      • Is it a Java Archive?
      • Is it a Chrome Plugin?
      • Is it a backup of photos?
      • Is it a ZIP attachment (e.g. ZIPSFX)
      • Is it password protected? Do ‘default’ protection passwords work?
    • Polyglot files are a thing
    • Web files can be hard to identify as they include keywords from many other languages; plus obfuscation; parsing this thing statically is impossible w/o a headless browser tool
    • ANSI vs. Unicode vs UTF8 vs. many other character encodings
    • UUEncode vs. BASE64 vs. many other data encodings
    • Is it a plain text, binary?
    • Is it English, Russian, Greek, Chinese?
    • What does the entropy tell us? How does it change across file content?
    • Does the file refer to any files provided in the same archive? Is there any link?
    • Is it packed/protected? Can it be statically unprotected? Unpacked?
    • What does a file content tell us?
      • Strings
      • Unicode Strings
      • MBCS Strings
      • Compiler
      • Linker
      • Section table
      • Import table
      • Export Table
      • .NET Metadata
      • Is it signed?
      • Is it an installer?
      • Any appended data?
      • Can files / sections inside be extracted statically (decompressor exists) and/or dynamically (unattended or guided installation)
      • Can the installation script be reconstructed?
      • Can we extract embedded files? Bitmaps, Icons, Movies, Strings, etc.
      • Any luck with Yara sigs?
      • What about a disassembly? decompilation?
      • Any compiler-specific metadata? (e.g. in Delphi files)
    • Is it malicious?
    • Is it a quarantined file?
    • Is it corrupted?
      • Not only based on a file structure inconsistencies; you may come across binary files that have a specific file type and format, but are saved… as Unicode; it breaks the format, but if you can recognize the type of corruption you may try to undo it and parse the recovered file
    • Any anomalies observed? (e.g. high number of sections; section size longer than a file size)
    • etc.

If you happen to be writing a tool like this, remember that it’s easier today than it was 10-15 years ago. We have tones of tools to rely on and lots of code available. Antivirus scans are easy wins, but then you have projects like pefile module, 7z, Universal Extractor, Resource Hacker in a CLI mode, hachoir, headless browsers, disassemblers, decompilers, virtual machines, etc.. Combining it together is hard, but possible, given time 🙂 Despite providing a long wishlist / description above, I have not implemented all of these things yet. And I’ve been coding it for 15 years. I guess 15 more to come 🙂

Finding good keywords

July 21, 2019 in Batch Analysis, Clustering, Malware Analysis

When we analyze malware one of the best tools we have at our disposal are signatures (in any form, doesn’t need to be yara or peid). Generating signatures is an art, an art that takes a lot of human cycles & would be interesting to automate this process. While binary signatures are kinda hard to extract automatically (you typically need to eyeball the code first), the ANSI/Unicode strings are pretty straightforward.

In this post I will run through a couple of strategies one can use to generate good keyword sets for malware research.

The first thing we can do is to focus on whitelisting. If we just run strings over the whole clean OS install we will end up with a long list of strings that you should not include in your production signatures. Once you have such a keyword list, you could run strings on a malware sample and exclude all the whitelisted strings from the output.

Of course, this is not a very good idea, because the ‘good’ strings will include lots of important stuff we actually want to see e.g. API names. The list will also be quite bulky, because every file including non-PE files will add tones of strings. All these useless keywords will affect performance.

To optimize this process we can narrow down our focus to PE files only. We can also try to be a little more specific — we can add some context to each string. This is a bit more time consuming, because we need to write a tool that can provide string metadata. For example, a tool like can help — it parses PE files and extracts strings in a context of each PE file section.

Okay, so now we have a list of all good strings, with a context.

This is good for exclusions, but not for inclusions. If you want to write e.g. yara signature, or your own signature scanner we need to find the juicy stuff.

How to do it?

In my experimental string extraction tool I introduced a concept of string islands. It exploits the fact that important strings are typically grouped together in a close proximity of each other inside the samples. Both in genuine, legitimate software, and in malware. The easiest examples where this principle works are PE file resources. Most of resource strings are obviously grouped together in this place. Import tables, export tables follow the same principle. And depending on a compiler, we will often find many strings used by the program in a very specific place of the file (e.g. .data section).

So… finding new keywords that could indicate the file as malicious can start with parsing a file, extracting its sections, extracting islands within each section, extracting strings within each island, and then using a short list of ‘needle’ keywords to determine if that specific island is ‘interesting’. We can use whitelisted strings as an exclusion as well (also, if we have the context, e.g. section where they come from, we can use surgical exclusion applied only to matching sections).

Now we have a very rich data set to work with. We excluded tones of non-interesting strings. We can do some stats, add interesting keywords back to the ‘needles’ pool and repeat the process. After few iteration you will observe a nice pattern emerging and your keyword list will quickly improve.

Using this principle I extracted thousands of very useful keywords and artifacts. The most attractive findings are islands where we can find clusters of keywords belonging to a single category. This helps to classify them on the spot.

An example of a ‘good’ island found this way is shown below. By the look of it it’s a typical infostealer. Using its built-in strings identifying targeted programs / password we can collect lots of juicy keywords in one go. These can make it directly to automatically generated yara rules & of course our ‘needle’ pool.