What can you do with 250K sandbox reports?
January 20, 2018 in Batch Analysis, Clustering, Malware Analysis, Sandboxing
I was recently asked about the data I released for the New Year celebration. The question was: okay, what can I do with all this alleged goodness?
Well…
For starters, this is the first time (at least to my knowledge) someone dumped 250K reports of sandboxed samples. The reports are not perfect, but can help you to understand the execution flow for many malware (and in more general terms: software) samples.
What does it mean in practice?
Let’s have a look…
Say you want to see all the possible driver names that these 250k include.
Why would you need that?
This could tell you what anti-analysis tricks malware samples use, what device names are used to fingerprint the OS. Perhaps some of these devices are not even documented yet!
grep -iE \\\\\.\\ Sandbox_250k_logs_Happy_New_Year_2018
gives you this:
You can play around with the output, but writing a perl/python script to extract these is probably a better idea.
Okay, what about the most popular function resolved using the GetProcAddress API?
Something like this could help:
grep -iE API::GetProcAddress Sandbox_250k_logs_Happy_New_Year_2018 | cut -d: -f3 | cut -d, -f2 | cut -d= -f2 | cut -d) -f1
This will give you a list of all APIs:
We can save the result to a file by redirecting the output of that command to e.g. ‘gpa.txt’:
grep -iE API::GetProcAddress Sandbox_250k_logs_Happy_New_Year_2018 | cut -d: -f3 | cut -d, -f2 | cut -d= -f2 | cut -d) -f1 > gpa.txt
This will take a while.
You can now sort it:
sort gpa.txt > gpa.txt.s
The resulting file gpa.txt.s can be then further analyzed – sorting by number of API occurrence, then sort the results in a descending order showing the most popular APIs:
cat gpa.txt.s | uniq -c | sort -r | more
All the above commands could be combined into a one, single ‘caterpillar’, but using intermediate files is sometimes handy. It facilitates further searcher later on… It also speeds things up.
Coming back to our last query, we could inquire for all APIs that include ‘Reg’ prefix/infix/suffix – this can give us some rough idea of what popular Registry APIs are resolved the most frequently:
cat gpa.txt.s | uniq -c | sort -r | grep -E "Reg" | more
How would you interpret the results?
There are some FPs there e.g. GetThemeBackgroundRegion, but it’s not a big deal. ANSI APIs (these with the ‘A’ at the end) are still more popular than the Unicode ones (more precisely, ‘Wide’ ones, with the ‘W’ at the end). Or, … the dataset we have at hand is biased towards older samples that were compiled w/o Unicode in mind. So… be careful… interpretation is very biased really.
But see? This is all an open book!
Again, want to emphasize that all the searches can be done in many ways. It’s also possible you will find some flaws in my queries. It’s OK. This is a data for playing around!
Now, imagine you want to see all the DELPHI APIs that we intercepted:
grep -iE Delphi:: Sandbox_250k_logs_Happy_New_Year_2018 | more
or, all inline functions from Visual C++:
grep -iE VC:: Sandbox_250k_logs_Happy_New_Year_2018 | more
or, all the rows with the ‘http://’ in it (highlighting possible URLs):
grep -iE http:// Sandbox_250k_logs_Happy_New_Year_2018 | more
You can also see what debug strings samples send:
grep -iE API::OutputDebugString Sandbox_250k_logs_Happy_New_Year_2018 | more
You can check what values are used by the Sleep functions:
grep -iE API::Sleep Sandbox_250k_logs_Happy_New_Year_2018 | more
and Windows searched by Anti-AV/Anti-analysis tools:
grep -iE API::FindWindow Sandbox_250k_logs_Happy_New_Year_2018 | more
etc. etc.
The sky is the limit.
You can look at the beginning of the every single sample and identify the ‘dynamic’ flow of the WinMain procedure for many different compilers, discover various environment variables used by different samples, cluster APIs from specific libraries, observe techniques like process hollowing, observe the distribution of WriteProcessMemory to understand how many sample use a RunPE for code injection and execution, and how many rely on Position-independent-code (PIC), you can see what startup points are the most frequently used (it’s not always HKCU\…\Run!) , what mechanisms are used to launch code in a foreign/process (e.g. RtlCreateUserThread, APC functions), how many processes are suspended before code is injected to them (CREATE_SUSPENDED), etc. etc.
Again… this data can remain a dead data, or you can make it alive by being creative and mining it in any possible way…
If you have any questions feel free to DM me on Twiter, or ping me directly via email.
Note: commercial use of this data is prohibited; I only mention it, because not only it’s most likely temping to abuse it, but you may be actually better off using a different data set. If you want to use it commercially I could provide you with 1.6M Unicode-based reports for analysis with more details included. Get in touch to find out more 🙂
Comments are closed.