Finding Smoking Gun and going beyond that – Helpful Forensic Artifacts

While I am quite critical about the idea of collecting IOCs (Indicator of Compromise) describing various malware, traces of hacking, etc in a form of hashes, even fuzzy hashes, file names, sizes, etc., etc.  I do believe that there is a certain number of IOCs (or as I call them: HFA –  Helpful Forensic Artifact – as they are not necessary relevant to compromise itself) that are universal and worth collecting. I am talking about artifacts that are common to malware functionality and offensive activities on the system in general as well as any other artifact that may help both attackers and… in investigation (thanks to ‘helpful’ users that leave unencrypted credentials in text files, watch movies on critical systems, etc.).

In this post, I will provide some practical examples of what I mean by that.

Before I kick it off, just a quick reminder – the reasons why I am critical about bloated IOC databases is that they have a very limited applicability in a general sense; and the limitations come as a result of various techniques used by malware authors, offensive teams, etc. including, but not limited to:

  • metamorphism
  • randomization
  • encryption
  • data (e.g. strings) build on the fly (instead of hardcoding)
  • shellcode-like payloads
  • fast-flux
  • P2P
  • covert channels
  • etc.

Notably, antivirus detections of very advanced, metamorphic malware rely on state machines not strings and it’s naive to assume that collecting file names like sdra64.exe is going to save the day…

Anyway…

If we talk about good, interesting HFAs I think of things that:

  • are very often used in malware because of a simple fact they need to be there (dropping files, autostart, etc.)
  • traces of activities that must be carried on the compromised system (recon, downloading toolchests, etc.)
  • also (notably) traces of user activity that support attacker’s work (e.g. a file password.txt is not an IOC, but it’s HFA)
  • traces of system being affected in a negative way e.g. if system has been compromised previously by a generic malware, certain settings could have been changed (e.g. disabled tracing, blocked Task Manager, etc.); they are IOCs in a generic sense, but not really relevant to the actually investigated compromise; you can think of it of these aspects of system security that place the system on the opposite side to the properly secured and hardened box; it also included previously detected/removed malware – imho AV logs are not ‘clear’ IOCs as long as they relate to the event that is not related to investigated compromise

If we talk about a typical random malware, it’s usually stupidly written, using snippets copied&pasted from many sources on the internet. The authors are lazy and don’t even bother to encrypt strings, so detection is really easy. You can grep the file or a memory dump of a suspected process for typical autorun strings with strings, BinText, HexDive and most of the time you will find the smoking gun. If the attacker is advanced, all you will deal with is a blob of binary data that has no visible trace of being malicious unless disassembled – that is, a relocation independent, shellcode-like piece of mixed code/data in a metamorphic form that doesn’t require all the fuss of inline DLL/EXE loading, but it’s just a pure piece of code. It’s actually simple to write with a basic knowledge of assembly language and knowledge of OS internals. I honestly don’t know how to detect such malware in a generic way. I do believe that’s where the future of advanced malware is though (apart from going mobile). And I chuckle when I see malware that is 20MB in size (no matter how advanced the functionality).

When we talk about IOC/HFAs and offensive security practices, it is worth mentioning that we need to follow the mind process of an attacker. Let me give you an example. Assuming that the attacker gets on the system. What things s/he can do? If the malware is already there, it’s easy as the functionality is out there and can be leveraged, malicious payload updated and attacker can do anything that the actual payload is programmed to do and within the boundaries of what environment where it runs permits. On the other hand, if it is an attack that comes through a typical hacking attempt, the situation is different. In fact, the attacker is very limited when it comes to available tools/functionality and often has to leverage existing OS tools. This means exactly what it says – attacker operates in a minimalistic environment and is going to use any possible tool available on OS to his/her benefit. If we talk about Windows system, it can be

  • net.exe (and also net1.exe)
  • telnet.exe
  • ftp.exe

but also

  • arp.exe
  • at.exe
  • attrib.exe
  • bitsadmin.exe
  • cacls.exe
  • certutil.exe
  • cmd.exe
  • command.com
  • compact.exe
  • cscript.exe
  • debug.exe
  • diantz.exe
  • findstr.exe
  • hostname.exe
  • icacls.exe
  • iexpress.exe
  • ipconfig.exe
  • makecab.exe
  • mofcomp.exe
  • more.com
  • msiexec.exe
  • mstsc.exe
  • net1.exe
  • netsh.exe
  • netstat.exe
  • ping.exe
  • powershell.exe
  • reg.exe
  • regedit.exe
  • regedt32.exe
  • regini.exe
  • regsvr32.exe
  • robocopy.exe
  • route.exe
  • runas.exe
  • rundll32.exe
  • sc.exe
  • schtasks.exe
  • scrcons.exe
  • shutdown.exe
  • takeown.exe
  • taskkill.exe
  • tasklist.exe
  • tracert.exe
  • vssadmin.exe
  • whoami.exe
  • wscript.exe
  • xcacls.exe
  • xcopy.exe

and OS commands

  • echo
  • type
  • dir
  • md/mkdir
  • systeminfo

and many other command line tools and commands.

So, if you analyze memory dump from a Windows system, it’s good to search for presence of a file name associated with built-in windows utilities and look at the context i.e. surrounding memory region to see what can be possibly the reason of it being there (cmd.exe /c being the most common I guess).

Back to the original reason of this post – since I wanted to provide some real/practical examples of HFAs that one can utilize to analyze hosts, let me start with a simple classification by functionality/purpose:

  • information gathering
    • net.exe
    • net1.exe
    • psexec.exe/psexesvc.exe
    • dsquery.exe
    • arp.exe
    • traces of shell being used (cmd.exe /c)
    • passwords.txt, password.txt, pass.txt, etc.
  • data collection
    • type of files storing collected data
      • possibly password protected archives
      • encrypted data (e..g credit cards/track data)
    • various 3rd party tools to archive data:
      • rar, 7z, pkzip, tar, arj, lha, kgb, xz, etc.
    • OS-based tools
      • compress.exe
      • makecab.exe
      • iexpress.exe
      • diantz.exe
    • type of collected data
      • screen captures often saved as .jpg (small size)
      • screen captures file names often include date
      • keystroke names and their variants
        • PgDn, [PgDn],{PgDn}
        • VK_NEXT
        • PageDown, [PageDown] {PageDown}
      • timestamps (note that there are regional settings)
      • predictable Windows titles
        • [ C:\WINDOWS\system32\notepad.exe ]
        • [ C:\WINDOWS\system32\calc.exe ]
        • [http://google.com/ – Windows Internet Explorer]
        • [Google – Windows Internet Explorer]
        • [InPrivate – Windows Internet Explorer – [InPrivate]]
      • possible excluded window class names
        • msctls_progress32
        • SysTabControl32
        • SysTreeView32
      • content of the address bar
      • attractive data for attackers
        • regexes for PII (searching for names/dictionary/, states, countries, phone numbers, etc. may help)
        • anything that matches Luhn algorithm (credit cards)
      • input field names from web pages and related to intercepted/recognized credentials
        • user
        • username
        • password
        • pin
      • predictable user-generated content
        • internet searches
        • chats (acronymes, swearwords, smileys, etc.)
  • data exfiltration
    • who
      • username/passwords
    • how
      • ftp client (ftp.exe, far.exe, etc.)
      • browser (POSTs, more advanced: GETs)
      • DNS requests
      • USB stick
      • burnt CD
      • printer
    • how
      • just in time (frequent network connection)
      • ‘coming back’ to the system
    • configuration
      • file
      • registry
      • uses GUI (lots of good keywords!)
    • where to:
      • URLs
      • FTP server names
      • SMTP servers
      • mapped drives (\\foo\c$)
      • mapped remote paths (e.g. \\tsclient)
  • malicious code
    • any .exe/.zip in TEMP/APPLICATION DATA subfolders
    • processes that have a low editing distance between their names and known system processes (e.g. lsass.exe vs. lsas.exe)
    • processes that use known system processes but start from a different path
    • areas of memory containing “islands” with raw addresses of APIs typically used by malware e.g. CreateRemoteThread, WriteProcessMemory, wininet functions
  • mistakes
    • Event logs
    • AV logs/quarantine files
    • leftovers (files, etc.)

Many of these HFAs form a very managable set that when put together can be applied to different data sets (file names, file paths, file content, registry settings, memory content, process dumps, etc.).

In other words – instead of chasing after a sample/family/hacking group-specific stuff, we look for traces of all these things that make a malware – malware, a weak system – weak, a hack – hack and attack-supporting user – victim.

Cluester – clustering as a data reduction technique

In this post I present a simple technique that can be quite helpful when you hit the wall during your analysis and don’t know what else to do. The idea relies on cluster analysis a.k.a. clustering. Notably, the implementation of this technique is very simple, but is hard to generalize – it depends very much on case and data. In an an example below I focus on Windows file system cluster analysis that are focused mainly on discovering minor, but grouped changes to the system – usually these are associated with a malware infection.

A typical computer forensics case deals with a tremendous amount of data and whether we analyze it in an automated fashion or walk our way through evidence manually we are pretty much wasting our time until we finally find the first clue. Purists could say here that we are not really wasting our time as we are slowly ruling out the possibilities and they are right, but I hope they will agree that such analysis are not fun. It can be quite frustrating, and more so as we know that finding that first trace of suspicious activity is what will shape further analysis – in fact it is what really kicks off the proper investigation (and this is also where the fun begins :)).

But finding that first piece of puzzle is not a trivial task. For many reasons. If you are lucky you may find malware still running in memory, the evidence has not been contaminated, or the clues are all over the place. In many cases though, the evidence may have been already contaminated by the attackers or ‘helpful’ IT staff that cleaned up the system before the data acquisition. There may be no malware involved at all as the case is dealing with a contraband, fraud, plagiarism or other issues. Also, in many production environment there is a lot changes introduced to the system on regular basis (e.g. via system updates, transaction records, logs, etc.), often saved into separate files generating large amount of small files. Analyzing such systems can be a real nightmare (e.g. a hundreds of thousands files on a single file systems).

Over the years  investigators designed many techniques to sift through the data and adapted them into many useful tools and methodologies. We have got Timelines, Filetyping, various filtering, Least Frequency of Occurrence, and many other system- or artifact-specific analysis e.g. these of processes, registry, prefetch files etc.

A typical Windows system is a mess. Operating system files, applications, user profiles, data copied all over the place by users (often many of them), followed by the content of temp folders, cache directories, recycled folders, admin scripts, registry clutter, and production data (if it applies). And lots more.

How to reduce this data into something useful?

One easy way to make some sense of it is by using filtering e.g. by file typing. Files with specific extensions can be grouped together and such groups assessed one by one. This may work, but is also very inefficient. There could be literally hundreds of thousands files of the same file extension. There are also hundreds, often thousands of different file extensions on the system. On top of that, file extension is not necessary indicative of a file content (.exe file doesn’t need to be an executable, and pictures or executable can be stored in files with misleading extensions).

Another way to look at the file system is to realize that file system changes are logically grouped and often happen in an atomic or almost-atomic way i.e. all at once within a short period of time – forming pretty much a sequence of ‘file system updates’. These are clusters we will try to focus on here. These updates often happen on regular intervals and affect only specific paths. Knowing that one can try to build an improved timeline, based not on all timestamps, but on ranges of timestamps.

First, an example of a typical timeline is shown below – one can see that I have copied grep.exe to the system together with  Cygwin DLLs that are necessary to run the grep.exe (as it was unable to resolve dependencies). I then ran it (as see with a Prefetch files). Next, I copied sleuthkit binaries into a tools directory and ran fls.exe creating a list file that I used for the purpose of this article.

As you can see, the sleuthkit binaries form a nice cluster. The script I used to generate the listing takes into account a number of files created at the same time and if the number of files is higher than 10, it pushes the file list to the right. Under normal conditions, the script wouldn’t need to print them at all, but I do it here for the sake of clarity with a hint to the reader that such a large ‘file update’ may be potentially skipped in analysis. One can easily modify the script to either not show them at all or perhaps apply some more logic before presenting the output. Of course, completely ignoring it is a wrong idea as only by seeing the example above you can tell that I happened to drop some files on the system. Again, depends on what you are looking for, you need to define the criteria that form your clusters and push out of the sight things that are not interesting.

One simple improvement that can be made to the timeline and clustering is timestamp normalization. By normalization I mean a way to glue more timestamps together, even if they do not match perfectly. It can be done in many way, for example:  if we assume that the window of atomic operation or almost-atomic ‘file system update’  is e.g. 10 or even whole 60 seconds, suddenly we will be able to increase our ability to assign more entries into a single cluster. More entries in a cluster, more chances for it to fall out of sight.

In my example above, merging the entries that fall within 60 seconds time window makes all grep-related artifacts be fall into one cluster:

The first column is a normalized timestamp (normalized to minutes i.e. 60 seconds), then actual timestamp, then path.

Clustering is a very interesting technique, yet it would seem not widely used. Applied on top of a timeline it could help reduce the amount of data for manual review and most importantly – may immediately highlight suspicious artifacts based on specific criteria. As I mentioned earlier, in the example above I decide to remove from the view all these entries that form the clusters of 10 or more files. This is arbitrary, but doesn’t have to be. It would be certainly helpful in finding malware, as malware rarely drops more than 10 files on the system, yet it would not helpful finding stashed content that has been copied in bulk. A lot more research is needed to find out how to utilize this technique widely and perhaps generate scenarios that can be converted into usable ‘dynamic smart filters’ that can be applied on top of any data.

The following things may be taken into consideration:

  • various artifacts (on top of timeline across the whole system)
  • various timestamps (one can run cluster analysis for all creation, modification, access, entry creation times)
  • various criteria can be used – across file system, or just directory, or directory and subdirectories
  • less utilized timestamps e.g. timestamps extracted from Portable Executables – these that stand out need to be looked at (standard system binaries are usually precompiled with a specific timestamp or their range)
  • not only timestamps and timelines can be used; e.g. image base of Portable Executables is also a good candidate for finding executables that stand out

The data can be manipulated and analyzed in many ways and ideally there should be an easy way to play with clustering parameters – I am not aware of any interactive tool that could do that in a generic way, but even simple toying around with Excel or a simple script can help here. If you know any free software that already does it, I would appreciate if you let me know.

Last, but not least, before even clustering is applied, one can remove entries using known techniques e.g. clean file hashes, known clean full paths, as well as data obtained from installer analysis extracted from case (run, collect data about file paths, apply as a whitelist), and other data that has a low chance for being interesting (e.g. directories outside of typical scope of system/user activity).

If you want to toy around with the idea, you can download Cluester – an example script that I used for this post.

It can be downloaded from  here.

It works on data obtained from running fls on NTFS system:

    fls -lrm <drive> -i <type> -f <type> \\.\<drive> > <output>

e.g.
    fls -lrm f: -i raw -f ntfs \\.\f:  > list

Update: 2012-05-28

Fixed typos, grammar mistakes, polished it a bit (that’s what happens when you write posts at 2am ;).