You are browsing the archive for Splunk, SPL.

Normalizing our path to Splunk enlightenment

May 23, 2020 in Splunk, SPL

One of the most annoying bits that we come across while doing log analysis is both predictability and unpredictability of file paths. Somehow…. everyone really… vendors, admins, and finally users keep coming up with new ways to name files that store their data. Combing through this mess feels like everything is an outlier.

I apply a simple strategy to get rid of the most predictable paths by using the Least Frequency Occurrence (LFO), counted across hosts (using dc(host)). If one path exists on say… at least 3 hosts, then it is less likely to be malicious than one that is found on one system only.

When you start using this technique you will come across many obstacles because lots of paths use random bits. Many of them are kinda predictable, it’s just the patterns keep piling up. And this is where normalization helps a lot.

Normalizing data is pretty easy – we use regex-based replace function to remove unnecessary junk. You have to start with the most precise patterns and then go towards the vague, hope-for-the-best ones.

Precise ones include:

  • GUID
  • SID
  • Date and time formats (e.g. YYYYMMDD)
  • System32|SysWOW64|sysnative paths
  • Program Files variants
  • User folders
  • Versioning information
  • Hashes or hash-lookalikes
  • etc.

One has to hope that these will reduce a lot of noise, and if it is still not the case we can go a bit further and start using more vague patterns e.g.:

  • full stop followed by digits, especially at the end of directory names
  • decimal numbers at the end of the path/folder
  • multiple digits in a row
  • hexadecimal numbers
  • underscore followed by alphanumeric/digits characters
  • etc.

It’s pretty hard to talk / learn about it without actual doing it hence I attached a test data set for you to play with (see bottom of this post).

The test set contains a number of fictional hosts (named after islands) and a bunch of paths that I made up so that we can demonstrate how the LFO and normalization can work in tandem. After you download the test set, you can import it to Splunk using the following name: test_dataset_paths.csv (I use it in my examples).

To confirm data is accessible via alookup file you can run this command in Splunk:

| inputlookup test_dataset_paths.csv

The set includes Paths from 6 hosts:


belonging to 6 users:

  • John, James, Paul, Kate, Joan, Alice

John, James, and Paul installed Firefox, and Kate, Joan and Alice – Chrome. John is the only user who has his system infected.

When we run the inputlookup command we can immediately see that even with a small number of rows it’s not that easy to comb through it:

Even if we want to do stats over it (per Path):

| inputlookup test_dataset_paths.csv
| stats values(Host) as Hosts by Path

we get this:

It’s only after we apply normalization we get a data set that makes it easier for us to decide what to discard:

| inputlookup test_dataset_paths.csv
| eval norm_path = Path
| eval norm_path = replace(norm_path, "(?i)c:\\users\\[^\]+", "")
| stats values(Path) as Paths values(Host) as Hosts count by norm_path

Since we see repetitions of identical normalized path across multiple systems, and some paths are clearly present on all the systems, we can remove them from the view using dc where number of hosts is at least 3:

| inputlookup test_dataset_paths.csv
| eval norm_path = Path
| eval norm_path = replace(norm_path, "(?i)c:\\users\\[^\]+", "")
| stats values(Path) as Paths dc(Host) as dch values(Host) as Hosts count by norm_path
| where dch < 3

This brings us to the final result that literally shows the malware:

If you are wondering why I am using dc and not count for exclusions, it’s because your data set can include repetitions from the same host (in such case count would be higher, while still applying to a single host); dc gives you number of all hosts on which specific Path occurred at least once.

I know it’s trivial and probably used by any splunker out there, but I remember that when I started learning SPL I was working with huge datasets from the start and it was quite overhwelming. Working with a small custom-made set makes it easier to test ideas and… regexes (especially these that require guessing how many backlashes to put to escape characters properly).

And speaking of the devil… here is a bunch of replace function regex examples you can consider using:

  • Users on PC
    • (?i)c:\\\users\\\[^\]+
  • Users on MAC
    • (?i)^/Users/[^/]+
  • GUID
    • (?i)[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
  • Timestamp (one of many variants; it’s NOT precise, but good enough for day to day work)
    • (?i)20[12][0-9][01][0-9][0-3][0-9][0-9]+\.[0-9]+
  • Timestamp using actual names of months
    • (?i)\d+(jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|june?|july?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\d+
  • SID
    • (?i)S-.-.-..-[0-9]+-[0-9]+-[0-9]+-[0-9]
  • Options (command line)
    • (?i) -+[a-z0-9-_]+=[^ ]+
  • Random directory with a tilde followed by digits
    • (?i)~\d+\\\\”

They may be buggy (both logic and formatting of this blog may affect it), so treat with caution and always check against your data set. It is mainly to show a couple of ideas that can help to start.

Using LFO, normalization and with a few other tricks (e.g. filtering by additional fields, counting per hosts, directories, bucketing, adding weights for risk-based scoring) leads us to a very favorable outcome: we stop using inclusion/exclusion lists i.e. we stop using signatures.

The data set is here.

Couple of Splunk/SPL Gotchas

July 26, 2019 in Splunk, SPL

Splunk is one of the coolest things that have ever happened to the blue team world. Seriously… everyone either used it, is using it, or plans on using it. I know there are alternatives, but this post is about Splunk only.

In the interest of full-disclosure: when I started playing with Search Processing Language (SPL), I did so with a very conservative mindset. I mean SQL-based (MySQL, MS SQL, even MS Access etc.) SLASH very GNU tools-oriented (DFIR/grep/perl sessions et all) background. Because of that, at first, the SPL looked to me like this kinda new, weird, hybrid, pretentious, hipsterish wannabe language that tries to marry these two remote worlds together in a very hmm counter-intuitive way…

Still… from a DFIR perspective I quickly embraced SPL as a great way to comb through a large data set w/o using the lengthy, and actually quite limited GNU commands.

I got ‘converted’.

I also managed to progress a bit farther – primarily, because I was privileged to work with people who are very solid SPL query writers. I picked a lot of knowledge from them, and not only the basics, but actually a totally different mindset in analyzing data. For that I am really grateful. And while I am still learning, I thought I will jot down some notes that may make somebody’s life easier, especially if they only start now…

You see… SPL comes with a lot of gotchas. If you are a hardcore SPL dev you kinda forget about them easily, because you know, you remember, you absorbed it many years ago and it’s now in your blood. But… if you are a noob like me, it really takes a lot of debugging cycles to figure this stuff out.

So, without further ado, the below is dedicated to all wannabe SPL blue devs…

Okay, one more thing I need to mention: the paragraphs below are intentionally NOT data-centric. While I could provide some examples of issues I encountered, I felt it’s better to just describe the things the way they are, I found them or I understand them. I may not be right on all of them (read: noob), I also feel that reading it like a story may help readers to approach SPL with a ‘fresh’ mind.

The first gotcha, in my opinion, is the SPL documentation. It’s very rich, it’s very detailed, but IMHO it’s written for people who design compilers, not those that simply want to use the SPL query language on daily basis. I actually feel that perl documentation is more user friendly. In my noobish eyes documentation for SPL is really hard to digest. I don’t know if it’s just me, but after reading a paragraph or two I often get disinterested and end up either asking around, or Stackoverflowing. Shame, I know.

The second gotcha is… keep a list of all your queries. The first ones are truly hard, wrong, mundane to develop, and even impossible to understand. Especially after few days away from them. Still, once you get into a ‘query hamstering’ habit, copypasta will be your best friend in a future.

After 2-3 months of storing any possible query you come across in your org or during your work, and both sourced internally or on Stack Overfow, you will build a corpora of queries that you can instantly use to develop… new queries. With a good solid base of ‘historical’ queries most of my new creations are almost exclusively copypastas & variations.

They are just tweaked and shuffled around according to the need du jour. The only ‘new’ stuff I have to face is when I go analysing new log sources. And anytime I do, I jot down every single query, even a lame one. Duplicates are fine, badly written queries are fine. Everything is fine. Just keep it, enrich it, re-purpose in the future. I can’t count how often my ‘snippets’ file saved me when I had to quickly put together a query looking for a specific set of artifacts. Once it’s there, it’s in your arsenal.

The next gotcha is… data field names. Most of fields are badly, inconsistently named or not extracted properly. At first, I was wondering how come Splunk Admins allow stuff like this to even enter their empire. It sounds reasonable to address it at the time logs make it to Splunk (otherwise, it will follow the infamous garbage in, garbage out paradigm). So, the gotcha is that more often than not the Splunk data that comes in is not properly extracted and is not production-ready. To be really fair though, this is not a fault of Splunk admins – they struggle no less than data users – they actually need to run the whole thing 24/7/365, satisfy compliance reqs and then support, support, support. Some random poor data quality issue is probably the least important on their daily priorities list. Establish a good rapport with these guys, and walk with them through these painful items one by one.

The last point is actually very interesting – there are so many things that can go wrong. Even original technology add-ons (TAs) developed by vendors often…. don’t extract data properly! Splunk admins are reluctant to change these, and rightly so (things may break on a global scale), so in the end it forces us to extract some specific data with dedicated regexes (and it’s of course slowing our queries down!).

Then there are other quirks… Same thing is named differently across different indexes – I actually covered a lot on this topic in my older post about log quality so won’t be getting into too much detail here.

But there is still more.

For example, If your field names include ‘.’, spaces, or any other character than can be interpreted in multiple ways, please do not use it. If you can, ask your Splunk gurus to tweak field names at source to avoid these special characters. Or rename them following a strict naming convention.


Because e.g. ‘.’ acts as a ‘+’ in eval functions (at least, but maybe in others too). If you fail to notice, or forget to escape full stop-separated field names you may end up thinking that your eval is working on that specific field value, while in reality you are just concatenating two parts of a field name together! This leads to unexpected results.

Another gotcha is a non-existing field. If your query assumes that field always exists, you will most likely write a wrong query. Or, more precisely, the query is all right, but the result is dependent on a presence of that particular field. It’s a very common mistake to assume that all events deliver the same field set.

They actually often don’t.

Not only many templates for the same Event Code exist, we also have to deal with lack of parsing, localization (field names change depending on OS language version!), updates to queries and.. data itself! It is very often that you will end up extracting fields manually and using functions like coalesce. And coalesce is NOT always the best option. Often, you will want to use if(isnull( construct. And evne better, in some cases you can check both if the value is NULL, or an empty string.

Some of the fields are pre-extracted, and some may not be. And to add the insult to the injury – fields extraction may change overnight. Literally. Your queries will start failing, because they expect a set of data that is no longer there. So… always debug your SPL to ensure there is no chances for such conditions to break your query. I use fillnull almost religiously in most of my queries to detect cases like this.

In other words – when you build a production query, build another one or more that keep an eye on a quality of input data. Is it even there? Does it come in right volumes, always on schedule, right format? Is it parsed properly?

The SPL quirks often get noticed when one realizes the difference between NOT (field=) and field!=. The NOT is safer as it compares against the result of comparison and if field doesn’t exist, it doesn’t bail out!. This is to be followed religiously if you want proper (read: fully inclusive) results.

Another interesting bit touches on unusual query results. One of the most intriguing quirks is a perceived ‘shifting’ of the content/columns of the table. The way it works is that you expect data in a certain column, and it’s… not there. Instead, it may be appearing in a column shifted one or two or more places to the left, or right. Moreso, if you swap field names sequence in a table syntax, you may notice that columns that caused problem before, now display proper results!

What is happening?

The underlining issue is often trivial: typically it is a side-effect of accidentally duplicating field names in your table syntax e.g foo in a below syntax occurs twice:

| table _time, foo, bar, –>foo<–, data, result

If one field is listed twice, only its first instance will be shown as a distinctive column header, but the data rows will still include the extra column with a duplicated data! Hence the visual shift.

A completely different set of gotchas come with the lookup tables. Populating them with data that is directly shown as a results inside a GUI of the browser is… a bad idea. It works most of the time. Until it doesn’t. Values with multiple spaces shown as a result of a SPL query will be always truncated/trimmed. If you copypaste them directly to a Lookup table, these entries will be binary-different from the actual data you want to include inside a lookup table. So… the best way to tackle it is is to export results of SPL queries to CSV first, open them in Excel, and copy these to a Lookup Editor, or even better – import them replacing the old CSV with a new one! Also, in latest Lookup Editor, the editing logic changed and I found it doesn’t work well on latest version of Chrome — copying values directly from Excel to Lookup Editor often breaks the data added. by introducing extra new lines in the most unexpected places. The Export -> Edit -> Import option is the only one that works flawlessly.

Finally, the last bit in today’s post: multivalue fields & search criteria optimization.

The best example to look at is the Account_Name field that is present in most of the Windows Event Logs. Many of them include two Account_Name fields. When Splunk ingests this data, it stores all the values of identically named fields inside a mulivalue field. Hence you have Account_Name[0] and Account_Name[1] that can be retrieved via e.g. mvindex.

The (positive) interesting thing about multivalue field is that you can still use it for direct comparisons! Yes, things like:


will work.

When such comparison is made it is in fact a comparison of Account_Name[0]=foobar and if the Account_Name is a multivalue it will iterate through all the values e.g. Account_Name[1]=foobar, Account_Name[2]=foobar ., etc.

So, while counter-intuitive, comparisons against MV fields are actually a good, working (inclusive!) approach.

And accessing the values one by one (splitting them into separate rows)? The way to deal with them is to use mvexpand e.g.:

mvexpand Account_Name

So… for any MV field, use the exclusions/inclusion in the main SPL query, then mvexpand, then apply additional logic that will work on the event split into multiple rows. It’s a bit of a (duplicated condition) pain, but this guarantees you won’t miss any events.

The last example touches big time on a topic of search optimization. The rule is simple: anything that you can do before a first pipe is your BIG win. Even if you need to narrow down results of ‘prepiped’ results it’s still usually very beneficial. It helps to narrow down the data that will be post-processed by more performance-heavy analysis/data crunching (e.g. regex). In other words, even if regex can do a more refined selection, always start with direct searches on the _raw field!

Last, but not least. I do not claim that everything stated above is true. I am still a noob SPL user and would welcome any clarification from anyone who spent more time with Splunk and SPL. Thanks in advance!

And totally last few bits of actual SPL goodness:

List all the macros:

| rest /servicesNS/-/-/admin/macros count=0

List all the alerts

| rest /services/alerts/fired_alerts splunk_server=local

| table eai:acl.owner id title triggered_alert_count