This software has been discontinued. Please use HexDive (it has all HAPI features plus lots more).
Also, check our other tools.
In one of my previous posts (Extracting Strings from PE sections), I demonstrated (ya… right, what a big word) how easy it is to extract sections of PE file into separate files using 7-Zip so that they can be later used for targeted strings analysis. As I mentioned, splitting a file into sections can be really useful as it helps to reduce a number of random string-alike non-strings we see in the output of ‘strings’ type of tools. Just to be on a secure side though – you may want to refer to my original post to find out more about caveats of such approach as there are cases when it may not be such a good idea.
There are many other techniques that can help in noise reduction and I am going to demonstrate one more today.
Analyzing Portable Executable (PE) files usually kicks off with running multiple static analysis tools including ‘strings’ and other tools that can help in determining what APIs are being used by a sample. One can use tools like PEDump, LordPE, PETools, Stud_PE, Dependency Walker, and lots of others that process sample’s import/export tables and help guessing what specific functionality is embedded in the sample.
Now, before we proceed further – three warnings here.
- You should never, ever conclude your malware analysis with the output of ‘strings’, or PE parsing tools. This is a first step to shooting yourself in the foot. Always do code analysis. I will come back to this topic in the future in a separate post.
- Ensure you actually know how these PE tools work. I know I don’t need to say this, but I have seen once a person using the Dependency Walker tool and analyzing malicious file by looking at the full list of functions exported from one of the Operating System DLLs. The DLL has not even been linked directly to a malware and was referenced only by a DLL that was directly linked to malicious .exe. In other words, the sample.exe was linked to kernel32.dll, kernel32.dll links to ntdll.dll. The guy was looking at the pane listing all functions exported by ntdll.dll. And while he was right that ntdll.dll does contain a lot of APIs used typically by malware, he was completely off the track! Oh, boy…
- Obviously, APIs can often be found outside the import table since many packers, protectors, wrappers move them from import tables to internal data structures – they are often visible only when the memory of the protected process is dumped to a file; thus, none of typical PE parsing tools can ‘see’ them
So, now back to the original topic.
One simple noise reduction technique that is well known and used by many analysts is based on lists of patterns; these can be keywords, ANSI or Unicode strings, regular expressions, and practically speaking – any string of bytes that is unique and can be helpful in identifying interesting stuff inside the samples. This technique is used to some extent by projects like Yara, PEiD, and of course, it is extensively used by antivirus and IDS software. Having a good pattern list that identifies certain class of artifacts inside a file is a very attractive idea and I must confess that I am using such lists myself for a number of years.
After thinking one day on how to improve typical ‘strings’ analysis process I cooked a little program that focuses on one class of such patterns – APIs.
First, I built a list of over 50,000 thousands clean APIs, including:
- Windows API
- native APIs
- kernel mode APIs
All of these are exported and imported by native Windows programs, drivers and DLLs. I combined them together into a large list. I then created a program that uses this list and searches for all of these inside the analyzed binary (note again: I run it most of the time on memory dumps, since many malicious samples come protected).
Yup. It’s that simple.
Now, you may be asking yourself – searching for 10-15 strings using a naive searching method (i.e. walk 10-15 times though the whole data searching for each string, or even using one regular expression) works well, but it is quite probable that for 50,000 and more strings we need to do better.
You are right.
This is a non-trivial problem, and naive algorithm doesn’t work here. Luckily, there are smart people out there who already figured it out. I looked around and researched various multi-pattern search algorithms – eventually deciding to use a very well-known multi-pattern algorithm – Aho-Corasick. It uses a very clever method of finding patterns by walking a trie anytime new character is fetched from the input, so it can search for a large set of patterns simultaneously (well, it’s more complicated than that, but let’s say it is very fast even for 50k patterns).
Since building the search trie that Aho-Corasick algorithm relies on takes quite some time, I precompiled it and included it directly into an executable. So, here it is – a simple tool that extracts known API names from a given binary.
I hope you will find it useful.
Used on a random malicious sample, it produces the following results:
HAPI v0.1 (c) Hexacorn 2012. All rights reserved.
Visit us at http://www.hexacorn.com
Okay, it’s not random. It’s the same one I used to demonstrate Anti-forensics – live examples