Normalizing our path to Splunk enlightenment

One of the most annoying bits that we come across while doing log analysis is both predictability and unpredictability of file paths. Somehow…. everyone really… vendors, admins, and finally users keep coming up with new ways to name files that store their data. Combing through this mess feels like everything is an outlier.

I apply a simple strategy to get rid of the most predictable paths by using the Least Frequency Occurrence (LFO), counted across hosts (using dc(host)). If one path exists on say… at least 3 hosts, then it is less likely to be malicious than one that is found on one system only.

When you start using this technique you will come across many obstacles because lots of paths use random bits. Many of them are kinda predictable, it’s just the patterns keep piling up. And this is where normalization helps a lot.

Normalizing data is pretty easy – we use regex-based replace function to remove unnecessary junk. You have to start with the most precise patterns and then go towards the vague, hope-for-the-best ones.

Precise ones include:

  • GUID
  • SID
  • Date and time formats (e.g. YYYYMMDD)
  • System32|SysWOW64|sysnative paths
  • Program Files variants
  • User folders
  • Versioning information
  • Hashes or hash-lookalikes
  • etc.

One has to hope that these will reduce a lot of noise, and if it is still not the case we can go a bit further and start using more vague patterns e.g.:

  • full stop followed by digits, especially at the end of directory names
  • decimal numbers at the end of the path/folder
  • multiple digits in a row
  • hexadecimal numbers
  • underscore followed by alphanumeric/digits characters
  • etc.

It’s pretty hard to talk / learn about it without actual doing it hence I attached a test data set for you to play with (see bottom of this post).

The test set contains a number of fictional hosts (named after islands) and a bunch of paths that I made up so that we can demonstrate how the LFO and normalization can work in tandem. After you download the test set, you can import it to Splunk using the following name: test_dataset_paths.csv (I use it in my examples).

To confirm data is accessible via alookup file you can run this command in Splunk:

| inputlookup test_dataset_paths.csv

The set includes Paths from 6 hosts:


belonging to 6 users:

  • John, James, Paul, Kate, Joan, Alice

John, James, and Paul installed Firefox, and Kate, Joan and Alice – Chrome. John is the only user who has his system infected.

When we run the inputlookup command we can immediately see that even with a small number of rows it’s not that easy to comb through it:

Even if we want to do stats over it (per Path):

| inputlookup test_dataset_paths.csv
| stats values(Host) as Hosts by Path

we get this:

It’s only after we apply normalization we get a data set that makes it easier for us to decide what to discard:

| inputlookup test_dataset_paths.csv
| eval norm_path = Path
| eval norm_path = replace(norm_path, "(?i)c:\\users\\[^\]+", "")
| stats values(Path) as Paths values(Host) as Hosts count by norm_path

Since we see repetitions of identical normalized path across multiple systems, and some paths are clearly present on all the systems, we can remove them from the view using dc where number of hosts is at least 3:

| inputlookup test_dataset_paths.csv
| eval norm_path = Path
| eval norm_path = replace(norm_path, "(?i)c:\\users\\[^\]+", "")
| stats values(Path) as Paths dc(Host) as dch values(Host) as Hosts count by norm_path
| where dch < 3

This brings us to the final result that literally shows the malware:

If you are wondering why I am using dc and not count for exclusions, it’s because your data set can include repetitions from the same host (in such case count would be higher, while still applying to a single host); dc gives you number of all hosts on which specific Path occurred at least once.

I know it’s trivial and probably used by any splunker out there, but I remember that when I started learning SPL I was working with huge datasets from the start and it was quite overhwelming. Working with a small custom-made set makes it easier to test ideas and… regexes (especially these that require guessing how many backlashes to put to escape characters properly).

And speaking of the devil… here is a bunch of replace function regex examples you can consider using:

  • Users on PC
    • (?i)c:\\\users\\\[^\]+
  • Users on MAC
    • (?i)^/Users/[^/]+
  • GUID
    • (?i)[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
  • Timestamp (one of many variants; it’s NOT precise, but good enough for day to day work)
    • (?i)20[12][0-9][01][0-9][0-3][0-9][0-9]+\.[0-9]+
  • Timestamp using actual names of months
    • (?i)\d+(jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|june?|july?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\d+
  • SID
    • (?i)S-.-.-..-[0-9]+-[0-9]+-[0-9]+-[0-9]
  • Options (command line)
    • (?i) -+[a-z0-9-_]+=[^ ]+
  • Random directory with a tilde followed by digits
    • (?i)~\d+\\\\”

They may be buggy (both logic and formatting of this blog may affect it), so treat with caution and always check against your data set. It is mainly to show a couple of ideas that can help to start.

Using LFO, normalization and with a few other tricks (e.g. filtering by additional fields, counting per hosts, directories, bucketing, adding weights for risk-based scoring) leads us to a very favorable outcome: we stop using inclusion/exclusion lists i.e. we stop using signatures.

The data set is here.