You are browsing the archive for Clustering.

What can you do with 250K sandbox reports?

January 20, 2018 in Batch Analysis, Clustering, Malware Analysis, Sandboxing

I was recently asked about the data I released for the New Year celebration. The question was: okay, what can I do with all this alleged goodness?


For starters, this is the first time (at least to my knowledge) someone dumped 250K reports of sandboxed samples. The reports are not perfect, but can help you to understand the execution flow for many malware (and in more general terms: software) samples.

What does it mean in practice?

Let’s have a look…

Say you want to see all the possible driver names that these 250k include.

Why would you need that?

This could tell you what anti-analysis tricks malware samples use, what device names are used to fingerprint the OS. Perhaps some of these devices are not even documented yet!

grep -iE \\\\\.\\ Sandbox_250k_logs_Happy_New_Year_2018

gives you this:

You can play around with the output, but writing a perl/python script to extract these is probably a better idea.

Okay, what about the most popular function resolved using the GetProcAddress API?

Something like this could help:

grep -iE API::GetProcAddress Sandbox_250k_logs_Happy_New_Year_2018 | 
cut -d: -f3 | cut -d, -f2 | cut -d= -f2 | cut -d) -f1

This will give you a list of all APIs:

We can save the result to a file by redirecting the output of that command to e.g. ‘gpa.txt’:

grep -iE API::GetProcAddress Sandbox_250k_logs_Happy_New_Year_2018 | 
cut -d: -f3 | cut -d, -f2 | cut -d= -f2 | cut -d) -f1 > gpa.txt

This will take a while.

You can now sort it:

sort gpa.txt > gpa.txt.s

The resulting file gpa.txt.s can be then further analyzed – sorting by number of API occurrence, then sort the results in a descending order showing the most popular APIs:

cat gpa.txt.s | uniq -c | sort -r | more

All the above commands could be combined into a one, single ‘caterpillar’, but using intermediate files is sometimes handy. It facilitates further searcher later on… It also speeds things up.

Coming back to our last query, we could inquire for all APIs that include ‘Reg’ prefix/infix/suffix – this can give us some rough idea of what popular Registry APIs are resolved the most frequently:

cat gpa.txt.s | uniq -c | sort -r | grep -E "Reg" | more

How would you interpret the results?

There are some FPs there e.g. GetThemeBackgroundRegion, but it’s not a big deal. ANSI APIs (these with the ‘A’ at the end) are still more popular than the Unicode ones (more precisely, ‘Wide’ ones, with the ‘W’ at the end). Or, … the dataset we have at hand is biased towards older samples that were compiled w/o Unicode in mind. So… be careful… interpretation is very biased really.

But see? This is all an open book!

Again, want to emphasize that all the searches can be done in many ways.  It’s also possible you will find some flaws in my queries. It’s OK. This is a data for playing around!

Now, imagine you want to see all the DELPHI APIs that we intercepted:

grep -iE Delphi:: Sandbox_250k_logs_Happy_New_Year_2018 | more

or, all inline functions from Visual C++:

grep -iE VC:: Sandbox_250k_logs_Happy_New_Year_2018 | more

or, all the rows with the ‘http://’ in it (highlighting possible URLs):

grep -iE http:// Sandbox_250k_logs_Happy_New_Year_2018 | more

You can also see what debug strings samples send:

grep -iE API::OutputDebugString Sandbox_250k_logs_Happy_New_Year_2018 | more

You can check what values are used by the Sleep functions:

grep -iE API::Sleep Sandbox_250k_logs_Happy_New_Year_2018 | more

and Windows searched by Anti-AV/Anti-analysis tools:

grep -iE API::FindWindow Sandbox_250k_logs_Happy_New_Year_2018 | more

etc. etc.

The sky is the limit.

You can look at the beginning of the every single sample and identify the ‘dynamic’ flow of the WinMain procedure for many different compilers, discover various environment variables used by different samples, cluster APIs from specific libraries, observe techniques like process hollowing, observe the distribution of WriteProcessMemory to understand how many sample use a RunPE for code injection and execution, and how many rely on Position-independent-code (PIC), you can see what startup points are the most frequently used (it’s not always HKCU\…\Run!) , what mechanisms are used to launch code in a foreign/process (e.g. RtlCreateUserThread, APC functions), how many processes are suspended before code is injected to them (CREATE_SUSPENDED), etc. etc.

Again… this data can remain a dead data, or you can make it alive by being creative and mining it in any possible way…

If you have any questions feel free to DM me on Twiter, or ping me directly via email.

Note: commercial use of this data is prohibited; I only mention it, because not only it’s most likely temping to abuse it, but you may be actually better off using a different data set. If you want to use it commercially I could provide you with 1.6M Unicode-based reports for analysis with more details included. Get in touch to find out more 🙂

Happy New Year 2018 & Get yourself logs from 250K sandboxed samples

December 31, 2017 in Batch Analysis, Clustering, Malware Analysis, Sandboxing

Update 2

Please use this link:!LItwzAAL!NqcMVEnIqd17x5guL0V55gwjy8Q3xQMuSyeP-DelbRE


Turns out I had a bug in my script and in the first go I exported less than 250K sessions (228K only), so I had to fix the dump and re-upload it. If you downloaded it previously, sorry, you will need to do it one more time 🙂

Thanks to @hrbrmstr for spotting and reporting the issue!

Old post

Happy New Year 2018!

Unless you are one of the companies or organizations doing commercial sample analysis and sandboxing it is almost impossible to get access to normalized data logs from sandboxing sessions. If you want to do analysis you need to either scrap data from the web, or run your own sandbox. In order to fill-in the gap I decided to release logs from 250,000 sandbox sessions.

  • The file contains logs from 250K sandboxed sessions (250K unique samples).
  • 32-bit PEs only. All executed Offline (no access to network).
  • Sometimes it may not be 100% accurate – I ran various sessions, with various settings/timeouts.
  • You’ll find traces of Windows API, NT API, VC and Delphi inline functions, COM, Visual Basic, string functions, Nullsoft APIs, Anti-VM tricks, etc. – and various stuff I discussed or will discuss in the Enter Sandbox series.

Have a look, run some analysis, crunch data – share results.


File sizes (sha1 hashes):

34,515,244,109 Sandbox_250k_logs_Happy_New_Year_2018
   993,911,182 Sandbox_250k_logs_Happy_New_Year_2018.7z

Note: This data cannot be used for commercial purposes.

If you like this release, you may also want to re-visit my older data dumps:

File format:

  • The file starts with a short header (easy to spot)
  • Then it’s followed by the ### SAMPLE #<number>
  • Then the actual logs start.
    • The lines start with [PID][TID][ADDRESS]
    • The API groups are prefixed with group prefixes i.e. API::, DELPHI::, VC:: (the latter are referring to inline functions)
    • The parameters are NOT named / structured accordingly to Windows API docs; this is because the log is focused on extracting the most useful information, and avoiding cluttering the log with the useless/unused function arguments (but then even this is only partially true, because this tool was growing organically over the years and was not an orchestrated effort to make OCDs  happy 😉 – if I was about to write it again, obviously it would be perfect 😉


### SAMPLE #00000001
[1980][252][00422c77]API::GetSystemTimeAsFileTime (lpSystemTimeAsFileTime=0012FFB0)
[1980][252][0041c305]API::GetModuleHandleW (lpModuleName=kernel32.dll)=7C800000
[1980][252][0041c315]API::GetProcAddress (mod=KERNEL32.dll, api=FlsAlloc)=00000000
[1980][252][0041c328]API::GetProcAddress (mod=KERNEL32.dll, api=FlsFree)=00000000
[1980][252][0041c33b]API::GetProcAddress (mod=KERNEL32.dll, api=FlsGetValue)=00000000
[1980][252][0041c34e]API::GetProcAddress (mod=KERNEL32.dll, api=FlsSetValue)=00000000
[1980][252][0041c361]API::GetProcAddress (mod=KERNEL32.dll, api=InitializeCriticalSectionEx)=00000000
[1980][252][0041c374]API::GetProcAddress (mod=KERNEL32.dll, api=CreateSemaphoreExW)=00000000
[1980][252][0041c387]API::GetProcAddress (mod=KERNEL32.dll, api=SetThreadStackGuarantee)=00000000
[1980][252][0041c39a]API::GetProcAddress (mod=KERNEL32.dll, api=CreateThreadpoolTimer)=00000000
[1980][252][0041c3ad]API::GetProcAddress (mod=KERNEL32.dll, api=SetThreadpoolTimer)=00000000
[1980][252][0041c3c0]API::GetProcAddress (mod=KERNEL32.dll, api=WaitForThreadpoolTimerCallbacks)=00000000
[1980][252][0041c3d3]API::GetProcAddress (mod=KERNEL32.dll, api=CloseThreadpoolTimer)=00000000
[1980][252][0041c3e6]API::GetProcAddress (mod=KERNEL32.dll, api=CreateThreadpoolWait)=00000000
[1980][252][0041c3f9]API::GetProcAddress (mod=KERNEL32.dll, api=SetThreadpoolWait)=00000000
[1980][252][0041c40c]API::GetProcAddress (mod=KERNEL32.dll, api=CloseThreadpoolWait)=00000000
[1980][252][0041c41f]API::GetProcAddress (mod=KERNEL32.dll, api=FlushProcessWriteBuffers)=00000000
[1980][252][0041c432]API::GetProcAddress (mod=KERNEL32.dll, api=FreeLibraryWhenCallbackReturns)=00000000
[1980][252][0041c445]API::GetProcAddress (mod=KERNEL32.dll, api=GetCurrentProcessorNumber)=00000000
[1980][252][0041c458]API::GetProcAddress (mod=KERNEL32.dll, api=GetLogicalProcessorInformation)=7C861E6F
[1980][252][0041c46b]API::GetProcAddress (mod=KERNEL32.dll, api=CreateSymbolicLinkW)=00000000
[1980][252][0041c47e]API::GetProcAddress (mod=KERNEL32.dll, api=SetDefaultDllDirectories)=00000000
[1980][252][0041c491]API::GetProcAddress (mod=KERNEL32.dll, api=EnumSystemLocalesEx)=00000000
[1980][252][0041c4a4]API::GetProcAddress (mod=KERNEL32.dll, api=CompareStringEx)=00000000
[1980][252][0041c4b7]API::GetProcAddress (mod=KERNEL32.dll, api=GetDateFormatEx)=00000000
[1980][252][0041c4ca]API::GetProcAddress (mod=KERNEL32.dll, api=GetLocaleInfoEx)=00000000
[1980][252][0041c4dd]API::GetProcAddress (mod=KERNEL32.dll, api=GetTimeFormatEx)=00000000
[1980][252][0041c4f0]API::GetProcAddress (mod=KERNEL32.dll, api=GetUserDefaultLocaleName)=00000000
[1980][252][0041c503]API::GetProcAddress (mod=KERNEL32.dll, api=IsValidLocaleName)=00000000
[1980][252][0041c516]API::GetProcAddress (mod=KERNEL32.dll, api=LCMapStringEx)=00000000
[1980][252][0041c529]API::GetProcAddress (mod=KERNEL32.dll, api=GetCurrentPackageId)=00000000
[1980][252][0041ab40]API::GetCommandLineW = "_0000034AD55817135B1B1C4AE97CD449.exe"
[1980][252][004228f9]API::GetModuleFileNameW (mod=00000000, namebuf=%SYSTEM%\_0000034AD55817135B1B1C4AE97CD449.exe, buflen=260)
[1980][252][00424495]API::MultiByteToWideChar (CodePage=000004E4,dwFlags=MB_PRECOMPOSED [00000001, 1],lpMultiByteStr= 
[1980][252][0042450c]API::MultiByteToWideChar (CodePage=000004E4,dwFlags=MB_PRECOMPOSED [00000001, 1],lpMultiByteStr= 
[1980][252][0041be2d]API::MultiByteToWideChar (CodePage=000004E4,dwFlags=MB_PRECOMPOSED [00000001, 1],lpMultiByteStr= 
[1980][252][0041bea1]API::MultiByteToWideChar (CodePage=000004E4,dwFlags=MB_PRECOMPOSED [00000001, 1],lpMultiByteStr= 
[1980][252][0041bf83]API::WideCharToMultiByte (cp= [000004E4, 1252],fl= [00000000, 0],wide= 
[1980][252][0041be2d]API::MultiByteToWideChar (CodePage=000004E4,dwFlags=MB_PRECOMPOSED [00000001, 1],lpMultiByteStr= 
[1980][252][0041bea1]API::MultiByteToWideChar (CodePage=000004E4,dwFlags=MB_PRECOMPOSED [00000001, 1],lpMultiByteStr= 
[1980][252][0041bf83]API::WideCharToMultiByte (cp= [000004E4, 1252],fl= [00000000, 0],wide= 
[1980][252][0041c543]API::SetUnhandledExceptionFilter (0042247D)
[1980][252][00401dde]VC::vc_strlen1 (lpString=\/)
[1980][252][0040df7c]API::GetTempPathA (namebuf=C:\DOCUME~1\USERNAME\LOCALS~1\Temp\, buflen=260)