I have recently been toying around with clustering of various malicious sample sets – running files through a sandbox and static analysis tools, and then applying various normalization and histograms to the output. The results are not mind-blowing, but encouraging. They help in grouping various malware families into separate buckets, improve log parsing routines, and in some cases can be also leveraged to quickly discover hidden properties of the malware e.g. encryption keys, User Agents, HTTP verbs, etc. etc. – these may be then used for more in-depth analysis of proxy logs, etc.
Here is a short list of ‘clusterable’ attributes just in case you want to design your own clustering solution and are looking for a quick cheat list; it is certainly far from being complete, but may give you some pointers:
STATIC
- File Name
- File Extension
- File Size
- File Type
- This will have a lot of ‘subtypes’ – for MZ files see details here and here
- For executable – sequence of bytes at the entry point, and at the real entry point (for main, wmain, DLLMain, as well as for VB, Delphi code)
- For PE file – for each of these: their names where applicable, sizes, flags, entropy, strings:
- sections (for list of known sections see here)
- import tables
- export tables
- For PE file –
- PE type
- Image base
- Compilation/debug time stamps
- Resources – number, topology
- Debug strings
- File Entropy
- Compiler (PEiD, etc.)
- Packer, protector
- File hashes (MD5, SHA1, CTPH, …)
- Extracted strings
- Presence and characteristics of appended data (e.g. installers)
- Sequences of code
- Disassembled code
- Decompiled code
- Selected code (e.g. map of calls)
- Detection by various AVs
- Multimedia properties (e.g. width, height, EXIF data, etc.)
DYNAMIC
- Accessed IPs
- Accessed URLs
- GET and POST Queries
- User Agents
- Ports used
- Created/accessed Mutexes/mutants
- Created/accessed Atoms
- Created/accessed Window names
- Created/accessed Window classes
- Created/accessed Windows topology
- Windows’ visibility
- Windows’ Unicodeness
- Windows’ topology
- Windows’ titles
- Windows’ classes
- Crypto used + built-in or API-based
- Popular strings used (e.g. copyright banners as seen here)
- Execution paths (code, sequences, code blocks, API sequences)
- Use of location-independent code
- Use of escalation of privileges tricks
- Use and type of code injection
- Use of kernel drivers (including system DLLs)
- Use of stolen certificates
- Use of anti-* techniques
- Use of 0days
- Use of timestomping
- Use of dynamically vbuilt strings (run-time)
- Use of code to adjust privileges)
- Use of keylogging techniques (and what type: hook, API hook, etc.)
- Use of external tools (e.g. cmd.exe, reg.exe, net.exe)
- Use of autoruns.inf
- Use of DKOM
- Use of code directly accessing physical drives
- Use of code directly accessing physical memory
- Use of code directly accessing BIOS
- Use of hypervisor
- MBR – code modification
- MBR – partition table modification
- Passwords used for encryption and to access (e.g. FTP/SMTP/IRC)
- Dropped file locations, names
- Searched path locations, registry names
- Targeted applications (e.g. browser, mail, IM and P2P clients, etc.)
- Added/modified registry entries
- APIs executed and their arguments
- Type of APIs (kernel32 win32 APIs or ntdll Zw/NT APIs)
- Delays used in waiting functions
- APIs/techniques used for memory allocation (heap, virtual*, stack-based, etc.)
- APIs/techniques used for self-deletion
- APIs/techniques used for running other .exes
- APIs/techniques used for network (winsock or wininet/also Rtl functions from ntdll)
- APIs/techniques used for network enumeration (Net*, WNet*, Domain*)
- Process enumeration APis
- …
Let me interrupt you here…
Okay, okay, I get i!!! It is a never ending list!!!