You are browsing the archive for Clustering.

Hunting for additional PE timestamps

January 4, 2019 in Batch Analysis, Clustering, File Formats ZOO, Malware Analysis, Reversing

Over the years I published a number of posts about file timestamps. Not the file system timestamps, but timestamps hidden inside the actual file content.

I wrote that there is a way to ‘heuristically’ carve timestamps from binary files. I have also provided some compilation timestamps stats for PE files, and discussed less-known Java folder timestamps. And when we talk about the PE files specifically, there is an PE compilation timestamp discussion in a context of timestomping. There is also a bit of rambling about the infamous Borland/Delphi resource timestamp.

Anyone who ever looked at the PE file specification or various PE Dump reports knows there are possibly more timestamps hidden inside these executable files.

For starters, we can look for timestamps sometimes present in additional PE file areas e.g. hidden inside a debug section, or even file’s signature – if they exist. There is an information about a compiler / linker version, Rich Header, .NET version, imported and exported APIs, imported DLLs, etc. All of them may help to narrow down the timeframe when the file was created. The DiE tool does a good job in helping with extraction of some of this information.

We can do a lot of guesswork based on the information available inside the metadata as well, for example by simply looking at file’s Version information block, or manifest. Then there are good old strings: if we are lucky the unencrypted strings embedded in a main file/configuration/update can get us an immediate answer. We may also rely on indirect references to time e.g. by looking at versions of statically compiled libraries (sometimes you can see actual version strings), sometimes bugs in code, and debug / verbosity logs that have not been stripped off from the release version; sometimes dates are included in the PDB string itself, and then sometimes there is a never-shown usage info, or even dead code that may include a dated ‘copyright’ note from the malicious author. These may be also included as comments inside the scripts, in case the binary file is a host to an interpreted code (this happens pretty often with older malware based on AHK, VBE, etc.). Advanced malware analysts can often deduce the version / rough time period of protector layers by just looking at the code (they can, because they write decrypters for this mess and often adjust code, even on daily basis).

Now, all of these can be modified, or fixed because they are very well-known. But there are more timestamps we can look at.

When we read the PE file documentation we can notice that it is rich in descriptions of all these additional timetstamp fields available.

The problem is… most of the time they are always zeroed.

But…

It is actually not always the case!

I was reading the description of the VS_FIXEDFILEINFO structure, and I realized that I never seriously looked at the two timestamp fields it includes:

dwFileDateMS Type: DWORD

The most significant 32 bits of the file’s 64-bit binary creation date and time stamp.

dwFileDateLS Type: DWORD

The least significant 32 bits of the file’s 64-bit binary creation date and time stamp.

So, I wrote a quick parser to search my PE metadata logs for samples where the values of these fields are not zero. And if not zero, must be within a certain, reasonable timestamp range (compiled between year 2000, Jan 1st, and today).

To my surprise, the script started spitting out names of samples that had these fields populated. Very often the values would be identical with a PE Compilation timestamp, but in some cases would be off by a few minutes, sometimes days and months. Since such cases provide a range between the version info compilation and actual PE file compilation it could provide an additional information about the timeframe of active development of the sample.

An example of such file can be found here.

An obvious question appears: which compilers/linkers produce these correctly compiled executables?

I don’t know at this stage. Also, despite some good results I need to emphasize that most of samples do not include a valid timestamp. Still… when it’s available… why shouldn’t we be extracting it?

What can you do with 250K sandbox reports?

January 20, 2018 in Batch Analysis, Clustering, Malware Analysis, Sandboxing

I was recently asked about the data I released for the New Year celebration. The question was: okay, what can I do with all this alleged goodness?

Well…

For starters, this is the first time (at least to my knowledge) someone dumped 250K reports of sandboxed samples. The reports are not perfect, but can help you to understand the execution flow for many malware (and in more general terms: software) samples.

What does it mean in practice?

Let’s have a look…

Say you want to see all the possible driver names that these 250k include.

Why would you need that?

This could tell you what anti-analysis tricks malware samples use, what device names are used to fingerprint the OS. Perhaps some of these devices are not even documented yet!

grep -iE \\\\\.\\ Sandbox_250k_logs_Happy_New_Year_2018

gives you this:

You can play around with the output, but writing a perl/python script to extract these is probably a better idea.

Okay, what about the most popular function resolved using the GetProcAddress API?

Something like this could help:

grep -iE API::GetProcAddress Sandbox_250k_logs_Happy_New_Year_2018 | 
cut -d: -f3 | cut -d, -f2 | cut -d= -f2 | cut -d) -f1

This will give you a list of all APIs:

We can save the result to a file by redirecting the output of that command to e.g. ‘gpa.txt’:

grep -iE API::GetProcAddress Sandbox_250k_logs_Happy_New_Year_2018 | 
cut -d: -f3 | cut -d, -f2 | cut -d= -f2 | cut -d) -f1 > gpa.txt

This will take a while.

You can now sort it:

sort gpa.txt > gpa.txt.s

The resulting file gpa.txt.s can be then further analyzed – sorting by number of API occurrence, then sort the results in a descending order showing the most popular APIs:

cat gpa.txt.s | uniq -c | sort -r | more

All the above commands could be combined into a one, single ‘caterpillar’, but using intermediate files is sometimes handy. It facilitates further searcher later on… It also speeds things up.

Coming back to our last query, we could inquire for all APIs that include ‘Reg’ prefix/infix/suffix – this can give us some rough idea of what popular Registry APIs are resolved the most frequently:

cat gpa.txt.s | uniq -c | sort -r | grep -E "Reg" | more

How would you interpret the results?

There are some FPs there e.g. GetThemeBackgroundRegion, but it’s not a big deal. ANSI APIs (these with the ‘A’ at the end) are still more popular than the Unicode ones (more precisely, ‘Wide’ ones, with the ‘W’ at the end). Or, … the dataset we have at hand is biased towards older samples that were compiled w/o Unicode in mind. So… be careful… interpretation is very biased really.

But see? This is all an open book!

Again, want to emphasize that all the searches can be done in many ways.  It’s also possible you will find some flaws in my queries. It’s OK. This is a data for playing around!

Now, imagine you want to see all the DELPHI APIs that we intercepted:

grep -iE Delphi:: Sandbox_250k_logs_Happy_New_Year_2018 | more

or, all inline functions from Visual C++:

grep -iE VC:: Sandbox_250k_logs_Happy_New_Year_2018 | more

or, all the rows with the ‘http://’ in it (highlighting possible URLs):

grep -iE http:// Sandbox_250k_logs_Happy_New_Year_2018 | more

You can also see what debug strings samples send:

grep -iE API::OutputDebugString Sandbox_250k_logs_Happy_New_Year_2018 | more

You can check what values are used by the Sleep functions:

grep -iE API::Sleep Sandbox_250k_logs_Happy_New_Year_2018 | more

and Windows searched by Anti-AV/Anti-analysis tools:

grep -iE API::FindWindow Sandbox_250k_logs_Happy_New_Year_2018 | more

etc. etc.

The sky is the limit.

You can look at the beginning of the every single sample and identify the ‘dynamic’ flow of the WinMain procedure for many different compilers, discover various environment variables used by different samples, cluster APIs from specific libraries, observe techniques like process hollowing, observe the distribution of WriteProcessMemory to understand how many sample use a RunPE for code injection and execution, and how many rely on Position-independent-code (PIC), you can see what startup points are the most frequently used (it’s not always HKCU\…\Run!) , what mechanisms are used to launch code in a foreign/process (e.g. RtlCreateUserThread, APC functions), how many processes are suspended before code is injected to them (CREATE_SUSPENDED), etc. etc.

Again… this data can remain a dead data, or you can make it alive by being creative and mining it in any possible way…

If you have any questions feel free to DM me on Twiter, or ping me directly via email.

Note: commercial use of this data is prohibited; I only mention it, because not only it’s most likely temping to abuse it, but you may be actually better off using a different data set. If you want to use it commercially I could provide you with 1.6M Unicode-based reports for analysis with more details included. Get in touch to find out more 🙂