The Hades haz you. Phantom (유령) – The DFIR drama from South Korea

May 18, 2013 in Others

The way the movies portray hacking, forensics, security research and coding is obviously metaphoric and usually made as visually rich as possible to ensure the audience ‘gets it’ and as a bonus can see how cool the process is. Anyone who spent a few sleepless nights with Olly and Ida Pro, worked around the clock on forensic cases, reviewed vulnerability reports or source code, or worked in their head on a particular algorithm for a few weeks before actually sitting down and writing the code knows that the reality is a bit more boring :)

If you ask a random security pro what are ‘the best’ hacking movies they will surely laugh pointing out at least a few from the following list:

..and perhaps at some stage they will suddenly become a bit more serious and mention that ‘but Matrix did show NMAP in action’.

Luckily, there are actually movies out there that beat all the above mentioned productions in terms of technical accuracy, and show a relatively realistic representation of  IT security work.

This post is about one of them.

A while ago I happened to stumble upon a Korean TV Drama called “Phantom” (also know as “Ghost“) that made my jaws drop. The drama was produced by a Korean Network SBS.

The plot of the drama is simple – The Hades haz you :)

hades

Copyright notice: The picture of Hades logo was taken from the clip on Youtube. The copyright belongs to SBS.

Okay, the plot is a bit more complicated – it’s a “Face off” meet “Jason Bourne” meet CSI.

Or

Evil Hackers from Korea and Hong Kong vs. Forensic guys from Korean Police.

Since it’s not IMDB, just a short note on the movie – I have already described bits of the plot; I don’t want to spoil it so I won’t add more information here. The music is all right. The acting is so so (the lead characters are a little bit too stiff and rarely smile). There are gaps in the story as well, but it’s a TV Drama after all, and it’s Korean so there is lots of melodrama ‘by default’. There is also a very strong product placement, but if this the only way to get funds to make TV dramas then so be it.

Okay, back to ‘technical’ stuff.

What makes this particular TV Drama stand out is the attention to details. While they didn’t completely escape typical Hollywood cliché (computers with the evidence are thrown out of the window, logic bombs with a progress bar, etc.) the makers really did their homework and put quite an effort to demonstrate how a typical hacking works. And how forensic guys investigate it.

Lots of scenes are taken in the forensic lab, or on the crime scene – in internet coffee shops, data centers, etc.. We also witness the actual data acquisition, evidence analysis (HDD, mobile, CCTV footage, video manipulation analysis, social media, Event Logs) and most importantly – lots of popular DFIR/RCE software is used to ‘understand’ the data and code. This is really not just a single random tool or a hand made HTML page that is supposed to look like ‘analysis results’. Quite the opposite – many of the most common tools from the DFIR/RCE/pentesting arsenal somehow found its way to the drama.

The software I remember seeing includes:

  • Encase
  • WinHex
  • Metasploit
  • OllyDbg
  • DCode
  • SecureCRT
  • Wireshark
  • XRY
  • BackTrack
  • Process Explorer

and lots more (I wish I took notes!).

Last, but not least – there are also realistic attacks being used as a part of the plot including, but not limited to:

  • 0Day exploits (using documents from Hangul Word Processor)
  • malware infections
  • billboard hacking
  • spoofed emails
  • identity theft
  • SCADA attacks
  • car hacking
  • hacking back in real time
  • DDoS attacks
  • Wi-Fi hacking
  • social engineering

and lo and behold – even STUXNET is mentioned!

Thumbs up South Korea!!!

UVWATAUAVAWH – Meet The Pushy String

May 16, 2013 in Batch Analysis, Malware Analysis, Silly

The title of this post is not a secret message and I am not intoxicated.

UVWATAUAVAWH happens to be the most popular string extracted from all .exe, .dll and .sys OS files on my 64-bit Windows. The string is so popular and at the same time suspicious that if you google it you will find people theorizing about it having something to do with BSODs / being a part of some internal ZeroAccess secret language.

If you convert the characters into hex:

UVWATAUAVAWH

you will get a string of bytes like these:

55 56 57 41 54 41 55 41 56 41 57 48

and these can be also represented as opcodes:

U  - push    rbp
V  - push    rsi
W  - push    rdi
AT - push    r12
AU - push    r13
AV - push    r14
AW - push    r15
H  - part of sub rsp, xxx opcode

The sequence is a very typical prologue for functions  (64-bit code) – so typical that it is all over the place together with its variants (see below); the ‘vowelized’ properties of these strings remind me an interesting paper about shellcodes that look like English text.

UVWATAUAVAWH
WATAUH
WATAUAVAWH
SUVWATAUAVAWH
SUVWATH
VWATAUAVH
SUVWATAUH
ATAUAVH
USVWATAUAVAWH
UVWATAUH
SUVWATAUAVH
SVWATAUAVAWH
USVWATH
USVWATAUH
USVWATAUAVH
VWATAUAVAWH
WAVAWH
ATAUAVAWH
VWATAUAWH
WATAVH
UVWATAUAVH

 

…and the most popular day for malware compilation is:

May 16, 2013 in Batch Analysis, Malware Analysis

Saturday.

Thursdays, Fridays, Saturdays are the days when the malware is compiled the most often.

It kinda makes sense*.

Who would like to work Sundays and Mondays?**

days_writing_malware

*remember what they say about statistics :) (data based on 2.5M samples)
**obviously, the APT guys

…and the most popular windows account for compiling malware is:

May 8, 2013 in Batch Analysis, Malware Analysis

Administrator.

Many malware samples contain debug strings that include paths often directly pointing to a location where the source code is stored and so it happens that often it’s also a location under the USERPROFILE. For the fun of it, I extracted the strings from a large batch of samples and came up with the following statistics (showing top 50):

   3893 Administrator
   2963 JUANJO
   1121 ryanch
    928 Boy
    617 UserXP
    612 user
    519 1337
    502 User
    465 Admin
    435 root
    422 bld4act
    418 Owner
    347 nosferatus
    305 Administrateur
    300 M4x
    296 ismael
    277 goga
    277 Kyle
    255 Mirko
    247 1134
    244 kdglkrkjdfhslej
    241 FEDERIKO
    234 t0fx
    231 rstephens
    219 DarkCoderSc
    218 gcc
    205 icyheart
    200 Dave
    197 michael
    197 Roshan
    197 James
    195 Ben
    182 John
    178 admin
    173 Dev
    161 box1
    157 nonadmin
    153 FELIPE
    152 Familie
    151 Timothy
    137 Dhivin
    133 Vortex
    131 Robert
    130 dabdoub
    129 USER
    127 dr zinou
    125 packar
    122 David
    116 nathu
    116 Daniel

It’s obviously biased.

Other interesting names include:

  • tom age five
  • GANGSTA
  • Krusty the Clown
  • ^_^
  • ItchyFingerz
  • irishboy
  • romantic
  • lol
  • brad pitt
  • Love Bebe
  • LorD^^$$steal3R
  • Cyber-Warrior Ender
  • auchan
  • F-B-I
  • Valued Sony Customer
  • SexyReplay
  • Microsoft
  • Poo
  • Trojan
  • P@wn3d
  • Emperor Zhou Tai Nu

There are over 7000 account names on the list. If you want the full list, please contact me offline.

JumpLists file names and AppID calculator

April 30, 2013 in Forensic Analysis, Software Releases

JumpList files are an interesting forensic artifact and as such they have been thoroughly explored by many researchers over last 2-3 years. There is really a lot of material out there and there are also many tools that parse JumpList files’ structure quite well. This is why in this post I will focus not on the content of JumpList files, but on their… file names.

Algorithm

The JumpList file names are created using hash-like values that in turn are based on something that is called AppID. The Forensics Wiki lists many known Jump List file names based on AppIDs; examples include:

  • 918e0ecb43d17e23 used by Notepad (32-bit)
  • 9b9cdc69c1c24e2b used by Notepad (64-bit)
  • 1bc392b8e104a00e used by Remote Desktop

and so on and so forth. The data from Forensics Wiki has been harvested from many sources and it’s a very useful reference for further research.

The algorithm to create a hash-like value is actually ‘sort of known’. There are posts out there suggesting that the AppID is a nothing but a CRC64 sum taken from the application path. For example, in this post, an Anonymous poster provided a Hexrays Decompiler’s code snapshot taken from shell32.dll showing how the AppID is generated. When I came across this particular comment I decided to verify it. I applied CRC64 sum to an example path and compared it with an expected known file name, and since you are reading this post you are probably guessing that it failed miserably :)

Okay, so since it failed and since the algorithm didn’t t seem to be explored in-depth yet I thought I will give it a go. It turned out to be quite simple, but there were a few challenges on the way that may be interesting to know about so I describe it below. I also ended up writing a perl script that I called AppID calculator (appid_calc.pl). It allows you to calculate an AppID based on provided string – more about it below as well. You can find a download link to the script at the bottom of this post.

Challenges

Using the code snippet I referred to earlier as a guidance, I quickly found the code responsible for generating AppIDs, put the appropriate breakpoints in a debugger, and.. immediately understood why the CRC64 (path) didn’t work for me earlier :)

The CRC64 algorithm has been indeed applied to a path, but there are a few quirks:

  • The path is first converted to Unicode
  • If the path is located in one of locations that are recognized and treated by system in a special way, the path is normalized first
  • The CRC64(Path) algorithm applies only to AppIDs automatically generated by the system; At any point of time any application can change its AppID either using the SetCurrentProcessExplicitAppUserModelID API, or can even apply window-specific AppID using  IPropertyStore::SetValue to change the PKEY_AppUserModel_ID property of  the particular window
  • On top of that, the CRC64 uses a non-standard polynomial

First, let’s talk about the CRC64. There are many CRC algorithms out there. In fact, the difference is not only between the length in bits (CRC16, CRC32, CRC64), but also in the configuration of a particular implementation. There are obviously many standard configurations (Wikipedia described quite a few), but the one used in AppID generation is not on the standard list. I know, because the very first thing I tried was to use all standard configurations, but all of them failed :-) .

The actual code used by the system relies on a precalculated lookup table, but googling around for the numbers from the table only brought 2-3 hits. In such case, the usual way of solving the issue is to rip the code from the source and reimplement it e.g. in perl.  This could be done easily. The 2-3 hits I mentioned earlier refer to a code that was created as a result of reverse engineering of thumbcache.dll  file – turns out that the very exact CRC64 configuration/implementation has been used in that DLL.

Exploring the properties of CRC I eventually managed to deduce the CRC configuration and the actual polynomial used to generate the lookup table.

The polynomial used by the AppID algorithm is 0x92C64265D32139A4.

Once I found out I went to google again and this time I also got 2-3 hits only. First two were on the Thumb Cache-related code I already mentioned. The last one was the Microsoft page describing the use of this particular polynomial in a ADSStreamHeader structure:

Crc (8 bytes): A bit-reversed CRC-64 hash of the FCIADS stream from the TimeStamp field to the end of the structure that can be used to validate the integrity of the FCIADS stream. The cyclic redundancy check (CRC) polynomial is x**64 + x**61 + x**58 + x**56 + x**55 + x**52 + x**51 + x**50 + x**47 + x**42 + x**39 + x**38 + x**35 + x**33 + x**32 + x**31 + x**29 + x**26 + x**25 + x**22 + x**17 + x**14 + x**13 + x**9 + x**8 + x**6 + x**3 + 1, with the leading 1 implied. The normal representation is 0x92C64265D32139A4.

That was a good sign and I could now start implementing the appid calculator w/o ripping the lookup tables.

The second issue to solve was the normalization.  The paths are normalized using KNOWNFOLDERIDs, so it’s a simple search and replace before applying the CRC.

One aspect of normalization I need to mention is… ambiguity. Depending on the OS (32 vs. 64 bit) different KNOWNFOLDERIDs are applied during the normalization path and it’s quite confusing. I suggest reading the Microsoft page I linked to above for further details.

Last, but not least. – quite a lot applications use SetCurrentProcessExplicitAppUserModelID API to change their AppID after they are executed. For example, the following applications do it (AppID – application name):

  • Microsoft.Silverlight.Offline – Silverlight
  • Microsoft.InternetExplorer.Default – Internet Explorer
  • VMware.Workstation.vmplayer – VMWare Player
  • Microsoft.Windows.MediaPlayer32 – Windows Media Player (32-bit)
  • Microsoft.Windows.MediaPlayer64 – Windows Media Player (64-bit)

For this reason, attempting to find e.g. AppID of c:\program files\Internet Explorer\iexplore.exe doesn’t really make sense as all IE windows are grouped under Microsoft.InternetExplorer.Default AppID.

Examples

AppIDs of InternetExplorer and Sticky Notes

appid_1

These can be confirmed by looking at Forensic Wiki:

  • Microsoft.InternetExplorer.Default28C8B86DEAB549A1

appid_2

  • Microsoft.Windows.StickyNotes337ED59AF273C758

appid_3

 Notepad

appid_4

You may notice that in this example there are 2 different AppIDs shown. This is because of the ambiguity I mentioned earlier; applications running on 64-bit systems can be executed in more than one configuration and since there is WOW64 folder redirection happening AppID needs to be calculated in a context.

The Notepad path looks the same to both 32- and 64-bit application (because of WOW64 folder redirection):

  • c:\windows\system32\notepad.exe

but the AppID depends on a type of Notepad .exe file:

  • if it is 32-bit, the AppID is 918E0ECB43D17E23
  • if 64-bit, the AppID is 9B9CDC69C1C24E2B.

This can be also confirmed via Forensic Wiki:

appid_6

Internet Explorer – via path

It gets even more complicated with Program Files folder as it has two versions – with and without (X86) and 32-/64- bit applications both ‘see’ Program Files the same way. As an example we could try to generate a hash for Internet Explorer in various configurations by running appid calculator and providing to it a path to c:\Program Files\Internet Explorer\iexplore.exe. As mentioned earlier IE uses an AppID that it sets up during the launch, so you should never see AppIDs shown on the screenshot below, but it is a simple example to show various configurations of Program Files folder using a well-known path.

appid_5

Again, I strongly suggest reading the Microsoft Article about KNOWNFOLDERIDs, The appid calculator provides a link to it as well if the path is known to be ambiguous (system32, program files, program files\common).

Download

You can find the script here. This is a first version, coded in a hurry so it may contain bugs. If you find any issues, please let me know. Thanks!

To run:

perl appid_calc.pl

If no argument is passed to it, it will calculate a few sample AppIDs – the examples illustrate various ways one can provide the path to the script:

  • c:\windows\notepad.exe
  • c:\windows\system32\notepad.exe
  • c:\windows\syswow64\notepad.exe
  • {1AC14E77-02E7-4E5D-B744-2EB1AE5198B7}\notepad.exe
  • c:\program files\Internet Explorer\iexplore.exe
  • MICROSOFT.INTERNETEXPLORER.DEFAULT

Java cache file names

April 19, 2013 in Forensic Analysis, Software Releases

I was wondering how Java generates the file names for its temporary cache files and after googling around, I found the answer in the Java source code – the function responsible is called generateCacheFileName and its implementation has changed over the time; here is how they do it in JDK 5 and 6/7:

JDK 5.xx

Files are saved in the following location:

  • %USERPROFILE%\Application Data\Sun\Java\Deployment\
    cache\javapi\v1.0\[cachefilename]

The procedure for generating [cachefilename] is described here:

JDK 6.xx-7.xx

Files are saved in the following location:

  • %USERPROFILE%\Local Settings\Application Data\Sun\Java\Deployment\
    cache\6.0\[cachebucket]\[cachefilename]

The procedure for generating [cachebucket]\[cachefilename] is described here:

The code

I ripped the code from these sources and created a simple java snippet that helps to test cache file name for a given URL. At the moment it has a small bug, but I hope you won’t notice it :)

Example – JRE 1.5

I googled around and found an old applet that worked under JRE 1.5, then visited the page so that the cached files could be created; the URL passed to the cachename Java program produces exactly same result:

javacache_1

Example – JRE 1.6-1.7

I simply visited Oracle web page that detects the browser and let the applet load:

javacache_2
Download

You can download the code here.

To compile, run:

javac cachename.java

To execute, run:

java cachename url

 

RegRipper Ripper (3R) and the list of reg keys covered by RR plugins

April 4, 2013 in 3RPG, Forensic Analysis

update

Updated 3R to cover the latest archive from the RegRipper site – plugins20130403.zip (new version introduced over 40 new scripts)

old post

I got curious what keys are already covered by existing 280+ RegRipper Plugins so I wrote a quick and dirty script to retrieve the data from all plugins in an automated way. For the fun of it, I named the script RegRipper Ripper (3R).

The script is here, and the result of running it over the latest bundle is available here.

You may use the list to see what’s already covered and… avoid writing a plugin for a key that is already handled.

The 3R is a dumb script, so a few things I had to fix manually (but still inside the script, so it can be used to regenerate the tables anytime needed, e.g. after the bundle update). I hope there are no mistakes, but if you spot any, please let me know and I will fix that. Thanks!

3RPG – 4 RegRipper Plugins in 15 minutes

March 15, 2013 in 3RPG, Forensic Analysis, Software Releases

In this post I show how to quickly develop 4 plugins using 3RPG. Except for the documentation (this post) it took barely 10-15 minutes.

You can download plugins here.

01. Detecting presence of 7zip on the system

7Zip has a key in the following location

HKEY_LOCAL_MACHINE\SOFTWARE\7-Zip

This is enough to build the script:

01_7zip1

Note that the name of the script is automatically prefixed with an underscore (7zip -> _7zip) for names starting with digits (it’s because perl doesn’t ‘like’ it).

Also, when you paste the 7zip registry key, and change the focus 3RPG will automatically strip HKEY_LOCAL_MACHINE\SOFTWARE part:

01_7zip2Now click the code – 3RPG will automatically select it all for your convenience.

01_7zip3

You can now copy this to any editor and save – use a name highlighted in red and with an extension .pl i.e. _7zip.pl.

Then run:

perl rip.pl -r SOFTWARE.copy0 -p _7zip

The result:

01_7zip4

02 Listing persistent network mappings

All mapped drives are listed under the following key:

HKEY_CURRENT_USER\Network

Again, we run through the same exercise as previously – this time we include ‘Yes, scan subkeys, depth=2′

02_netmap1

Then run:

perl rip.pl -r NTUSER.DAT -p netmap

and the result is:

02_netmap2b

03. Listing all possible CLSID autostart entries

Amongst various less-known autostart mechanisms that I listed in my older post we can find adding or re-using entries of COM servers. Such technique can be used to introduce a man-in-the-middle code for a legitimate plugins, shell extensions, etc. .

The information about the COM servers is stored under the following key:

HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID

The names of DLLs, EXEs, etc. are usually listed under {Default} value, so the plugin below will list (going recursively through the whole node) all possible {Default} values listed under CLSID node.

03_clsid1

We run it as:

perl rip.pl -r Software2 -p clsid

And the results are:

03_clsid2

This is not a perfect solution as many {Default} values don’t include a file name, but we could either grep results by specific extension e.g. dll, or patch the script manually and add a better routine (e.g. only list values under InprocServer32 and LocalServer32)

03_clsid3

Last, but not least – running this plugin often probably doesn’t make sense as it’s very slow, but it is a simple example that demonstrates how to search for {Default} values.

 04. Listing keys with binary data

This is just another simple example showing how REG_BINARY data is presented in the output of plugins generated with 3RGP.

For the example, I will look at the key

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\
CurrentVersion\Print\Printers\Microsoft XPS Document Writer

associated with Microsoft XPS Document Writer and its value Default DevMode.

I don’t know what’s exactly inside this key, but since it contains a binary blob, it will serve the purpose here.

04_xps1

We run it as:

perl rip.pl -r Software2 -p xps

And the results are:

04_xps2

That’s it! Thanks for reading!

3RPG – Rapid RegRipper Plugin Development

March 14, 2013 in 3RPG, Forensic Analysis, Software Releases

Inspired by DFIR posts from users (often non-programmers) requesting help with writing/improving RegRipper plugins I created a new tool that aims at developing RR plugins in a much faster way.

The tool is called 3RPG and it’s oriented mainly at non-programmers and less experienced programmers. Of course, if you are an old school perl programmer, go ahead and try it as well. Any feedback and comments will be much appreciated.

What is 3RPG?

3RPG is a web form that helps you to quickly build Plugins for RegRipper by Harlan Carvey.

You just need to fill-in a few fields and the code of the new plugin will be ‘developed’ instantly in front of your eyes.

You can go and check how it works here – 1000 words worth screenshot should help you to get the idea:

3rpg_1

Benefits a.k.a. why 3RPG was created?

If you are a non-programmer…

  • You can use a web form to instantly create your own RegRipper Plugin for a specific registry node/key
  • If you need to add extra features, you can pass such script with example data to more experienced RegRipper plugin programmers – trust me, they will appreciate the effort you put into research and will be more eager to help
  • You can save 3RPG as an HTML page and use it offline

If you are a programmer…

  • You know that writing new RegRipper plugins ‘by hand’ is kinda painful i.e. it’s easier to modify existing script to add features than starting from the scratch
  • Creating new scripts is usually a copy and paste game – there is always a chance for making a silly typo or mistake
  • In general – in many cases simply (recursively) enumerating a specific registry node/key and cherry-picking something with a simple filter is enough
  • Also, adding a generic data print mechanism for all possible registry data types helps to quickly ‘analyze’ plugins’ output w/o any extra effort
  • ..and this is exactly what the 3RPG offers; more complex scenarios require (obviously) some manual coding
  • You can also fetch the template and adjust it to your needs manually – I am confident that with small modifications it may support all possible registry retrieval needs
  • If you are curious about technical details, I talk about it at the bottom of this post

How to use 3RPG?

Just go to the 3RPG Wizard, fill in the form (takes 1-2 minutes), then copy and paste the resulting script and save to a file – once you do, you are ready to go!

To run/test the script, use the newly created file (here myscript) with RegRipper:

perl rip.pl -r <hive> -p myscript

For a typical script, these fields are required:

  • a script name e.g. myplugin.pl
  • a hive name(s) e.g. Software
  • a node e.g. Microsoft\Windows\CurrentVersion\Run
  • a key name/value (works like a filter) e.g. x86
  • if you want to scan subkeys (recursively, you can also specify the depth)
  • if you want to include Wow6432Node keys (typically, you do since many new systems are 64-bit)

and then leave the rest fields with default values.

Share!

If you write a new plugin, share the script with the community (if you do, please fill-in the rest of the fields to avoid generic/default values in the scripts. Thanks!)

 

Examples

Software \ Run key enumeration

Implementing a classic Run key enumeration for the Software hive is easy – it’s actually already written for you on the 3RPG page (it’s based on default values of 3RPG).

Just copy the script from 3RPG page

3rpg_1c

and save it as ‘myscript.pl’, then run it as:

rip.pl -r SOFTWARE.copy0 -p myscript

Running it with a test hive gives the following results:

3rpg_2

Software \ Run key enumeration with a specific value

A similar example as before, we just want to narrow down the search looking for e.g. for ‘MSN’

We just need to type ‘msn’ (it’s case insensitive) in ‘What keys/values would you like to include?‘ field:

3rpg_3

Saving the resulting script and running as previous will only show keys/values/data for values/data that contain ‘msn’ (keys are not checked as you are enumerating recursively anyway).

3rpg_4

Technical details

3RPG is a web form. It’s written in HTML + JavaScript. As a base for the plug-in I relied on my old generic RR plugin template that I used in the past. It exploits the fact that the registry data is stored in a tree-like fashion, so recursive enumeration is a natural way of parsing such data w/o going into intricacies of parsing specific keys, values, and conditional processing. It is also very similar to the way command line reg.exe works when executed with ‘query’ or ‘query /s’.

Currently, the following features are supported:

  • 3RPG is interactive – changes to the script are instantly visible and highlighted in the source code
  • A script name can be specified from the form
  • A hive can be selected manually, but script will try to select the correct one based on the key i.e. some hive name(s) are automatically selected when key names including substrings like ‘HKEY_LOCAL_MACHINE\Software’ are pasted
  • Enumeration of keys can be recursive, with a specified depth
  • Filtering of key names/values is possible
  • Code for parsing Wow6432Node nodes can be added with a single click
  • Data dumping is supported for all registry data types (non-printable data is printed as hex)

Bugs

It’s the first version, so bugs are there for sure; if you spot any, please do let me know.

Thanks in advance!

Clustering and Batch Analysis of APT1 sampleset, part 3

March 12, 2013 in Batch Analysis, Malware Analysis

Part 1, Part 2, Part 3

In the last three posts I talked about batch analysis, clustering and applying these techniques to APT sampleset.

Batch processing is a step necessary for retrieving ‘clusterable’ data from samples in an automated fashion.

Clustering is a way of putting these samples into buckets, potentially grouping them into some families.

I want to see if w/o using any assumption/knowledge (retrieved from the white paper or other blogs) it is possible to cluster these samples in a reliable way. It is an interesting experiment and I am curious if I will ever get closer to already known clusters. Quite frankly, I don’t know yet. We shall see.

The clustering I have done so far was focused on dynamic analysis and a little bit on the source code analysis. In this post I will exploit code analysis further – this time focusing on disassembled .asm files generated as usual by the IDA Pro.

The resulting assembly code is quite nice for parsing as each line contains only one line of code – this allows to group the code into blocks on function boundaries and for each call to API or to another subroutine (including calls via registers), we can extract a simplified code of the program procedures e.g.

sub_401000    proc near        ; CODE XREF: _main+20Ap
[...]

lea    ecx, [esp+310h+szLongPath]
push    104h        ; nSize
push    ecx        ; lpFilename
push    0        ; hModule
call    ds:GetModuleFileNameA

lea    edx, [esp+310h+szLongPath]
push    104h        ; cchBuffer
lea    eax, [esp+314h+szLongPath]
push    edx        ; lpszShortPath
push    eax        ; lpszLongPath
call    ds:GetShortPathNameA

lea    ecx, [esp+310h+Parameters]
push    offset String2    ; "/c del "
push    ecx        ; lpString1
call    ds:lstrcpyA

mov    esi, ds:lstrcatA
lea    edx, [esp+310h+szLongPath]
lea    eax, [esp+310h+Parameters]
push    edx        ; lpString2
push    eax        ; lpString1
call    esi ; lstrcatA

lea    ecx, [esp+310h+Parameters]
push    offset s->>>nul    ; " >>NUL"
push    ecx        ; lpString1
call    esi ; lstrcatA

mov    esi, ds:ShellExecuteA
push    0        ; nShowCmd
push    offset Directory ; lpDirectory
lea    edx, [esp+318h+File]
push    offset Parameters ; "/c    del wuauclt.exe"
push    edx        ; lpFile
push    offset Operation ; "open"
push    0        ; hwnd
call    esi ; ShellExecuteA

push    0        ; nShowCmd
push    offset Directory ; lpDirectory
lea    eax, [esp+318h+File]
push    offset s->CDelSvchost_exe ; "/c    del svchost.exe"
push    eax        ; lpFile
push    offset Operation ; "open"
push    0        ; hwnd
call    esi ; ShellExecuteA

[...]
retnsub_401000    endp

becomes

GetModuleFileNameA
GetShortPathNameA
lstrcpyA
lstrcatA
lstrcatA
ShellExecuteA
ShellExecuteA
ShellExecuteA

and can be written as a single line of code

GetModuleFileNameA|GetShortPathNameA|lstrcpyA|lstrcatA|lstrcatA|ShellExecuteA|ShellExecuteA|ShellExecuteA

Applying such methodology on procedure boundaries and to each disassembled program I eventually came up with a shortened and flattened source code of each sample. I then built a histogram of the most common sequences of such code blocks across all the source code from all files and got the following stats:

   5514 |sub
   2507 |sub|sub
   1332 |sub|sub|sub
    860 |sub|sub|sub|sub
    558 |__security_check_cookie(x)
    479 |__security_check_cookie(x)|__security_check_cookie(x)
    475 |sub|sub|sub|sub|sub
    392 |sub|sub|sub|sub|sub|sub
    353 |operator delete(void *)
    276 |sub|operator delete(void *)
    269 |sub|sub|sub|sub|sub|sub|sub
    235 |sub|sub|sub|sub|sub|sub|sub|sub
    185 |sub|sub|sub|sub|sub|sub|sub|sub|sub
    168 |sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
    165 |__alloca_probe|sub|sub
    137 |eax
    132 |sub|sub|ecx
    132 |__alloca_probe|sub
    130 |_atexit
    123 |sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
    110 |_chkstk|sub|sub
    108 |strlen|operator delete(void *)|operator new(uint)|strcpy
    106 |nullsub
    106 |__alloca_probe
    101 |_chkstk|sub
     97 |eax|sub
     92 |__alloca_probe|sub|sub|sub|sub
     91 |__alloca_probe|sub|sub|sub
     88 |_chkstk|sub|sub|sub
     88 |__alloca_probe|sub|sub|sub|sub|sub|sub
     85 |__alloca_probe|sub|sub|sub|sub|sub
     80 |exception const &)
     75 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     73 |strlen
     73 |_chkstk|sub|sub|sub|sub|sub
     72 |sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     71 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     71 |_Tidy(bool,uint)
     69 |sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     68 |sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     68 |_chkstk|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     68 |_chkstk|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     68 |_chkstk|sub|sub|sub|sub|sub|sub|sub|sub
     68 |InternetCloseHandle|InternetCloseHandle|InternetCloseHandle
     67 |sub|eax
     63 |_chkstk|sub|sub|sub|sub|sub|sub
     62 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     62 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub
     61 |free
     60 |sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     60 |allocator<char>>(char const *)|_atexit
     59 |sub|_CxxThrowException(x,x)
     56 |_CxxThrowException
     56 |InternetReadFile
     55 |_chkstk|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     55 |_chkstk
     55 |SetUnhandledExceptionFilter
     52 |operator new(uint)|exception(char const * const &)|_CxxThrowException(x,x)
     52 |operator delete(void *)|_CxxThrowException(x,x)
     52 |_flsall
     51 |_chkstk|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     51 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub
     50 |_chkstk|sub|sub|sub|sub
     49 |j_free
     48 |_chkstk|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     47 |sub|sub|_CxxThrowException(x,x)
     47 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     45 |sub|sub|sub|sub|eax
     44 |strchr|strchr
     44 |malloc|sub|sub|free
     43 |dword ptr [ecx+8]
     42 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     40 |sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     40 |sub|_Split(void)|_wmemmove|sub|_Eos(uint)|_Split(void)|_Tidy(bool)|sub
     40 |operator delete(void *)|operator delete(void *)
     40 |_chkstk|sub|sub|sub|sub|sub|sub|sub|sub|sub
     40 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     40 |ReadFile|_memcpy_0
     39 |sub|_CxxThrowException
     39 |GetModuleFileNameA|GetShortPathNameA|GetEnvironmentVariableA|lstrcpyA|lstrcatA|lstrcatA|GetCurrentProcess|SetPriorityClass|GetCurrentThread|SetThreadPriority|ShellExecuteExA|SetPriorityClass|SetProcessPriorityBoost|SHChangeNotify|GetCurrentProcess|SetPriorityClass|GetCurrentThread|SetThreadPriority
     38 |sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     38 |_chkstk|sub|sub|sub|sub|sub|sub|sub
     37 |GetCurrentProcess|OpenProcessToken|LookupPrivilegeValueA|AdjustTokenPrivileges|CloseHandle|GetLastError
     36 |sub|sub|dword ptr [eax]|sub|sub|sub
     36 |sub|ecx
     36 |dword ptr [ecx+4]
     36 |_memset|sub|__security_check_cookie(x)
     35 |sub|sub|__security_check_cookie
     35 |sub|operator delete(void *)|operator delete(void *)|operator delete(void *)|operator delete(void *)
     35 |__invalid_parameter_noinfo
     34 |operator new(uint)
     34 |_free
     34 |_LocaleUpdate(localeinfo_struct *)|___strgtold12_l|sub|__security_check_cookie(x)
     33 |sub|sub|eax|sub
     33 |sub|operator delete(void *)|operator delete(void *)
     33 |_chkstk|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     33 |__errno|__invalid_parameter
     32 |operator delete(void *)|operator new(uint)
     32 |memset
     31 |operator new(uint)|sub
     31 |_chkstk|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     30 |eax|sub|sub|sub|sub
     30 |_chkstk|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     30 |__EH_prolog|_Tidy(bool)|_strlen|sub|sub|_CxxThrowException(x,x)
     30 |SetServiceStatus
     28 |_chkstk|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     27 |sub|_Split(void)|_memcpy|sub|_Eos(uint)|_Split(void)|_Tidy(bool)|sub
     27 |strlen|sub
     27 |memcpy
     27 |_strcmpi|memset|memset|CreateToolhelp32Snapshot|Process32First|sprintf|strcat|Process32Next|CloseHandle|_strcmpi|OpenSCManagerA|EnumServicesStatusExA|operator new(uint)|CloseServiceHandle|strcat|EnumServicesStatusExA|sprintf|strcat|operator delete(void *)|CloseServiceHandle|_strcmpi|GetLogicalDrives|sprintf|strcat|sprintf|strcat|lstrcatA|GetDriveTypeA|strcat|GetVolumeInformationA|strcat|strcat|sprintf|strcat
     27 |_strcmpi|atoi|OpenProcess|TerminateProcess|CloseHandle|strcat|_strcmpi|OpenSCManagerA|OpenServiceA|GetLastError|strcat|CloseServiceHandle|ControlService|GetLastError|strcat|CloseServiceHandle|CloseServiceHandle
     27 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub
     27 |GetProcAddress
     27 |GetExitCodeProcess|PeekNamedPipe|Sleep|ReadFile|CloseHandle|CloseHandle|memset|strcpy|strlen
     26 |sub|sub|sub|sub|_memcpy_s
     26 |sub|eax|sub|eax|sub
     26 |sub|_Tidy(bool)|_Tidy(bool)|sub
     26 |strstr|strchr|operator new(uint)|strchr|strchr|strchr|strchr|strchr|strchr|strchr|strchr|strchr|operator delete(void *)
     26 |strlen|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     26 |sprintf|HttpAddRequestHeadersA|HttpSendRequestA|GetLastError|InternetQueryOptionA|InternetSetOptionA|sprintf
     26 |__ld12cvt
     26 |___strgtold12|sub
     26 |__EH_prolog3|sub|sub|_CxxThrowException(x,x)
     26 |InternetOpenA|InternetSetOptionA|InternetSetOptionA|InternetSetOptionA|InternetConnectA|HttpOpenRequestA|strlen|HttpAddRequestHeadersA
     26 |$+5
     25 |rand
     25 |malloc|CreatePipe|CreatePipe|CloseHandle|CloseHandle|CloseHandle|CloseHandle|free|sub|CloseHandle|CloseHandle
     25 |_chkstk|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     25 |__invalid_parameter_noinfo|__invalid_parameter_noinfo
     25 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     25 |URLDownloadToFileA|strcat
     24 |sub|sub|sub|sub|sub|GetProcAddress|sub|sub|sub
     24 |sub|edx|sub
     24 |sub|_Split(void)|_wmemmove|sub|_Eos(uint)|_Split(void)|sub|sub
     24 |shutdown|closesocket
     24 |send
     24 |fopen|fseek|fread|fseek|ftell|fseek|fread|fclose|fclose|fread|fclose|sub
     24 |edx
     24 |dword ptr [eax+40h]
     24 |_beginthreadex|CloseHandle
     24 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     24 |GetModuleHandleA|GetProcAddress
     23 |unknown_libname_1
     23 |sub|sub|sub|sub|operator delete(void *)
     23 |sub|OpenProcess|TerminateProcess|Sleep|CloseHandle|sub
     23 |strlen|CreateFileA|strlen|operator new(uint)|memset|WriteConsoleInputA|operator delete(void *)|CloseHandle
     23 |strcat|sub|WaitForSingleObject|strcat|strcat|strlen|sub
     23 |j_free|j_free
     23 |j_free|_CxxThrowException
     23 |LoadStringA|sub
     23 |CloseHandle
     22 |~type_info(void)|operator delete(void *)
     22 |sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     22 |sub|operator new(unsigned __int64)|exception(char const * const &)|_CxxThrowException|sub|sub|j_free
     22 |operator new(uint)|operator new(uint)|sub
     22 |operator new(uint)|operator delete(void *)
     22 |operator delete(void *)|operator delete(void *)|operator delete(void *)
     22 |exception(char const * const &)
     22 |eax|sub|sub|sub
     22 |GetCurrentProcess|GetCurrentProcess|DuplicateHandle|CreateProcessA|CloseHandle
     22 |CompareStringA
     22 |$+5|sub|sub
     21 |sub|_wcslen|sub|sub|sub|sub
     21 |sprintf|sprintf|sub
     21 |malloc|recv|sub|sub|_strnicmp|WriteFile|recv|free|ExitThread|SetEvent|free|ExitThread
     21 |malloc|PeekNamedPipe|ReadFile|sub|sub|_itoa|send|sub|Sleep|PeekNamedPipe|free|ExitThread
     21 |_strcmpi|memset|CreateProcessA|strcat|CloseHandle|_strcmpi|OpenSCManagerA|strcat|OpenServiceA|GetLastError|strcat|CloseServiceHandle|StartServiceA|GetLastError|strcat|CloseServiceHandle|CloseHandle
     21 |__get_sse2_info
     21 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     21 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     21 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     21 |__alloca_probe|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub|sub
     21 |GetCurrentProcess|OpenProcess|GetLastError|sprintf|strcat|OpenProcessToken|memset|sprintf|CreateProcessAsUserA|strcat|CloseHandle|CloseHandle|GetLastError|sprintf|strcat|CloseHandle|GetLastError|sprintf|strcat|CloseHandle
     21 |CreateEventA|CreateEventA|sub|WaitForSingleObject|CloseHandle
     21 |$+5|sub

Using these shortened procedures for cluster generations gives some promising results e.g.:

sub
DeleteFileW
DeleteFileA

1328eaceb140a3863951d18661b097af.asm
31e5e58dbdfad05175613e795298ebb5.asm
6f9992c486195edcf0bf2f6ee6c3ec74.asm
c99fa835350aa9e2427ce69323b061a9.asm
e476e4a24f8b4ff4c8a0b260aa35fc9f.asm
ea1b44094ae4d8e2b63a1771a3e61fd5.asm
fc1937c1aa536b3744ebdfb1716fd54d.asm
LoadLibraryA
GetProcAddress
GetProcAddress
GetProcAddress

3f8682ab074a097ebbaadbf26dfff560.asm
4b19a2a6d40a5825e868c6ef25ae445e.asm
54d5d171a482278cc8eacf08d9175fd7.asm
56de2854ef64d869b5df7af5e4effe3e.asm
75dad1ccabae8adeb5bae899d0c630f8.asm
8462a62f13f92c34e4b89a7d13a185ad.asm
htons
socket
connect
closesocket

468ff2c12cffc7e5b2fe0ee6bb3b239e.asm
727a6800991eead454e53e8af164a99c.asm
bd8b082b7711bc980252f988bb0ca936.asm
db05df0498b59b42a8e493cf3c10c578.asm
e1b6940985a23e5639450f8391820655.asm
ecx
eax
dword ptr [esi+10h]
sub
ecx
eax
sub
sub
sub
sub
sub
sub
sub
sub

12f25ce81596aeb19e75cc7ef08f3a38.asm
268eef019bf65b2987e945afaf29643f.asm
468ff2c12cffc7e5b2fe0ee6bb3b239e.asm
4c6bddcca2695d6202df38708e14fc7e.asm
5a728cb9ce56763dccb32b5298d0f050.asm
727a6800991eead454e53e8af164a99c.asm
8e8622c393d7e832d39e620ead5d3b49.asm
bd8b082b7711bc980252f988bb0ca936.asm
c6a4bb1a4e4f69ec71855d70d6960859.asm
db05df0498b59b42a8e493cf3c10c578.asm
e1b6940985a23e5639450f8391820655.asm
ef8e0fb20e7228c7492ccdc59d87c690.asm
LoadLibraryA
GetProcAddress
sub
sub
strstr
strchr
GetSystemDirectoryA
time
srand
malloc
sub
sub
strncmp
Sleep
sub
Sleep
sub
Sleep
CreatePipe
CreatePipe
GetStartupInfoA
CreateProcessA
GetLastError
_snprintf
sub
CreateProcessA
CreateThread
CreateThread
WaitForMultipleObjects
GetExitCodeThread
TerminateThread
GetExitCodeThread
TerminateThread
GetExitCodeProcess
TerminateProcess
sub
sub
GetLastError
_snprintf
sub
CloseHandle
CloseHandle
CloseHandle
CloseHandle
sub
sub
Sleep
PeekNamedPipe
ReadFile
sub
0dd3677594632ce270bcf8af94819caf.asm
270d42f292105951ee81e4085ea45054.asm
523f56515221161579ee6090c962e5b1.asm

Notably, the disassembled code – after some selective processing and normalization – can be treated in a same way as student source code submissions for their assessments at uni and… be checked for plagiarism. The most common technique used for this purpose relies on measuring the  cosine similarity. I am currently playing with it and will write more about my findings in another post.

Thanks for reading!