Clustering and Batch Analysis

I have recently been toying around with clustering of various malicious sample sets – running files through a sandbox and static analysis tools, and then applying various normalization and histograms to the output. The results are not mind-blowing, but encouraging. They help in grouping various malware families into separate buckets, improve log parsing routines, and in some cases can be also leveraged to quickly discover hidden properties of the malware e.g. encryption keys, User Agents, HTTP verbs, etc. etc. – these may be then used for more in-depth analysis of proxy logs, etc.

Here is a short list of ‘clusterable’ attributes just in case you want to design your own clustering solution and are looking for a quick cheat list; it is certainly far from being complete, but may give you some pointers:

STATIC

File Name
File Extension
File Size
File Type
- This will have a lot of ‘subtypes’ – for MZ files see details here and here
- For executable – sequence of bytes at the entry point, and at the real entry point (for main, wmain, DLLMain, as well as for VB, Delphi code)
- For PE file – for each of these: their names where applicable, sizes, flags, entropy, strings:
  - sections (for list of known sections see here)
  - import tables
  - export tables
- For PE file –
  - PE type
  - Image base
  - Compilation/debug time stamps
  - Resources – number, topology
  - Debug strings
File Entropy
Compiler (PEiD, etc.)
Packer, protector
File hashes (MD5, SHA1, CTPH, …)
Extracted strings
Presence and characteristics of appended data (e.g. installers)
Sequences of code
- Disassembled code
- Decompiled code
- Selected code (e.g. map of calls)
Detection by various AVs
Multimedia properties (e.g. width, height, EXIF data, etc.)

DYNAMIC

Accessed IPs
Accessed URLs
GET and POST Queries
User Agents
Ports used
Created/accessed Mutexes/mutants
Created/accessed Atoms
Created/accessed Window names
Created/accessed Window classes
Created/accessed Windows topology
Windows’ visibility
Windows’ Unicodeness
Windows’ topology
Windows’ titles
Windows’ classes
Crypto used + built-in or API-based
Popular strings used (e.g. copyright banners as seen here)
Execution paths (code, sequences, code blocks, API sequences)
Use of location-independent code
Use of escalation of privileges tricks
Use and type of code injection
Use of kernel drivers (including system DLLs)
Use of stolen certificates
Use of anti-* techniques
Use of 0days
Use of timestomping
Use of dynamically vbuilt strings (run-time)
Use of code to adjust privileges)
Use of keylogging techniques (and what type: hook, API hook, etc.)
Use of external tools (e.g. cmd.exe, reg.exe, net.exe)
Use of autoruns.inf
Use of DKOM
Use of code directly accessing physical drives
Use of code directly accessing physical memory
Use of code directly accessing BIOS
Use of hypervisor
MBR – code modification
MBR – partition table modification
Passwords used for encryption and to access (e.g. FTP/SMTP/IRC)
Dropped file locations, names
Searched path locations, registry names
Targeted applications (e.g. browser, mail, IM and P2P clients, etc.)
Added/modified registry entries
APIs executed and their arguments
- Type of APIs (kernel32 win32 APIs or ntdll Zw/NT APIs)
- Delays used in waiting functions
- APIs/techniques used for memory allocation (heap, virtual*, stack-based, etc.)
- APIs/techniques used for self-deletion
- APIs/techniques used for running other .exes
- APIs/techniques used for network (winsock or wininet/also Rtl functions from ntdll)
- APIs/techniques used for network enumeration (Net*, WNet*, Domain*)
- Process enumeration APis
- …

Let me interrupt you here…

Okay, okay, I get i!!! It is a never ending list!!!

Hexacorn

Hexacorn

Clustering and Batch Analysis

STATIC

DYNAMIC

Let me interrupt you here…