Couple of Splunk/SPL Gotchas

Splunk is one of the coolest things that have ever happened to the blue team world. Seriously… everyone either used it, is using it, or plans on using it. I know there are alternatives, but this post is about Splunk only.

In the interest of full-disclosure: when I started playing with Search Processing Language (SPL), I did so with a very conservative mindset. I mean SQL-based (MySQL, MS SQL, even MS Access etc.) SLASH very GNU tools-oriented (DFIR/grep/perl sessions et all) background. Because of that, at first, the SPL looked to me like this kinda new, weird, hybrid, pretentious, hipsterish wannabe language that tries to marry these two remote worlds together in a very hmm counter-intuitive way…

Still… from a DFIR perspective I quickly embraced SPL as a great way to comb through a large data set w/o using the lengthy, and actually quite limited GNU commands.

I got ‘converted’.

I also managed to progress a bit farther – primarily, because I was privileged to work with people who are very solid SPL query writers. I picked a lot of knowledge from them, and not only the basics, but actually a totally different mindset in analyzing data. For that I am really grateful. And while I am still learning, I thought I will jot down some notes that may make somebody’s life easier, especially if they only start now…

You see… SPL comes with a lot of gotchas. If you are a hardcore SPL dev you kinda forget about them easily, because you know, you remember, you absorbed it many years ago and it’s now in your blood. But… if you are a noob like me, it really takes a lot of debugging cycles to figure this stuff out.

So, without further ado, the below is dedicated to all wannabe SPL blue devs…

Okay, one more thing I need to mention: the paragraphs below are intentionally NOT data-centric. While I could provide some examples of issues I encountered, I felt it’s better to just describe the things the way they are, I found them or I understand them. I may not be right on all of them (read: noob), I also feel that reading it like a story may help readers to approach SPL with a ‘fresh’ mind.

The first gotcha, in my opinion, is the SPL documentation. It’s very rich, it’s very detailed, but IMHO it’s written for people who design compilers, not those that simply want to use the SPL query language on daily basis. I actually feel that perl documentation is more user friendly. In my noobish eyes documentation for SPL is really hard to digest. I don’t know if it’s just me, but after reading a paragraph or two I often get disinterested and end up either asking around, or Stackoverflowing. Shame, I know.

The second gotcha is… keep a list of all your queries. The first ones are truly hard, wrong, mundane to develop, and even impossible to understand. Especially after few days away from them. Still, once you get into a ‘query hamstering’ habit, copypasta will be your best friend in a future.

After 2-3 months of storing any possible query you come across in your org or during your work, and both sourced internally or on Stack Overfow, you will build a corpora of queries that you can instantly use to develop… new queries. With a good solid base of ‘historical’ queries most of my new creations are almost exclusively copypastas & variations.

They are just tweaked and shuffled around according to the need du jour. The only ‘new’ stuff I have to face is when I go analysing new log sources. And anytime I do, I jot down every single query, even a lame one. Duplicates are fine, badly written queries are fine. Everything is fine. Just keep it, enrich it, re-purpose in the future. I can’t count how often my ‘snippets’ file saved me when I had to quickly put together a query looking for a specific set of artifacts. Once it’s there, it’s in your arsenal.

The next gotcha is… data field names. Most of fields are badly, inconsistently named or not extracted properly. At first, I was wondering how come Splunk Admins allow stuff like this to even enter their empire. It sounds reasonable to address it at the time logs make it to Splunk (otherwise, it will follow the infamous garbage in, garbage out paradigm). So, the gotcha is that more often than not the Splunk data that comes in is not properly extracted and is not production-ready. To be really fair though, this is not a fault of Splunk admins – they struggle no less than data users – they actually need to run the whole thing 24/7/365, satisfy compliance reqs and then support, support, support. Some random poor data quality issue is probably the least important on their daily priorities list. Establish a good rapport with these guys, and walk with them through these painful items one by one.

The last point is actually very interesting – there are so many things that can go wrong. Even original technology add-ons (TAs) developed by vendors often…. don’t extract data properly! Splunk admins are reluctant to change these, and rightly so (things may break on a global scale), so in the end it forces us to extract some specific data with dedicated regexes (and it’s of course slowing our queries down!).

Then there are other quirks… Same thing is named differently across different indexes – I actually covered a lot on this topic in my older post about log quality so won’t be getting into too much detail here.

But there is still more.

For example, If your field names include ‘.’, spaces, or any other character than can be interpreted in multiple ways, please do not use it. If you can, ask your Splunk gurus to tweak field names at source to avoid these special characters. Or rename them following a strict naming convention.

Why?

Because e.g. ‘.’ acts as a ‘+’ in eval functions (at least, but maybe in others too). If you fail to notice, or forget to escape full stop-separated field names you may end up thinking that your eval is working on that specific field value, while in reality you are just concatenating two parts of a field name together! This leads to unexpected results.

Another gotcha is a non-existing field. If your query assumes that field always exists, you will most likely write a wrong query. Or, more precisely, the query is all right, but the result is dependent on a presence of that particular field. It’s a very common mistake to assume that all events deliver the same field set.

They actually often don’t.

Not only many templates for the same Event Code exist, we also have to deal with lack of parsing, localization (field names change depending on OS language version!), updates to queries and.. data itself! It is very often that you will end up extracting fields manually and using functions like coalesce. And coalesce is NOT always the best option. Often, you will want to use if(isnull( construct. And evne better, in some cases you can check both if the value is NULL, or an empty string.

Some of the fields are pre-extracted, and some may not be. And to add the insult to the injury – fields extraction may change overnight. Literally. Your queries will start failing, because they expect a set of data that is no longer there. So… always debug your SPL to ensure there is no chances for such conditions to break your query. I use fillnull almost religiously in most of my queries to detect cases like this.

In other words – when you build a production query, build another one or more that keep an eye on a quality of input data. Is it even there? Does it come in right volumes, always on schedule, right format? Is it parsed properly?

The SPL quirks often get noticed when one realizes the difference between NOT (field=) and field!=. The NOT is safer as it compares against the result of comparison and if field doesn’t exist, it doesn’t bail out!. This is to be followed religiously if you want proper (read: fully inclusive) results.

Another interesting bit touches on unusual query results. One of the most intriguing quirks is a perceived ‘shifting’ of the content/columns of the table. The way it works is that you expect data in a certain column, and it’s… not there. Instead, it may be appearing in a column shifted one or two or more places to the left, or right. Moreso, if you swap field names sequence in a table syntax, you may notice that columns that caused problem before, now display proper results!

What is happening?

The underlining issue is often trivial: typically it is a side-effect of accidentally duplicating field names in your table syntax e.g foo in a below syntax occurs twice:

| table _time, foo, bar, –>foo<–, data, result

If one field is listed twice, only its first instance will be shown as a distinctive column header, but the data rows will still include the extra column with a duplicated data! Hence the visual shift.

A completely different set of gotchas come with the lookup tables. Populating them with data that is directly shown as a results inside a GUI of the browser is… a bad idea. It works most of the time. Until it doesn’t. Values with multiple spaces shown as a result of a SPL query will be always truncated/trimmed. If you copypaste them directly to a Lookup table, these entries will be binary-different from the actual data you want to include inside a lookup table. So… the best way to tackle it is is to export results of SPL queries to CSV first, open them in Excel, and copy these to a Lookup Editor, or even better – import them replacing the old CSV with a new one! Also, in latest Lookup Editor, the editing logic changed and I found it doesn’t work well on latest version of Chrome — copying values directly from Excel to Lookup Editor often breaks the data added. by introducing extra new lines in the most unexpected places. The Export -> Edit -> Import option is the only one that works flawlessly.

Finally, the last bit in today’s post: multivalue fields & search criteria optimization.

The best example to look at is the Account_Name field that is present in most of the Windows Event Logs. Many of them include two Account_Name fields. When Splunk ingests this data, it stores all the values of identically named fields inside a mulivalue field. Hence you have Account_Name[0] and Account_Name[1] that can be retrieved via e.g. mvindex.

The (positive) interesting thing about multivalue field is that you can still use it for direct comparisons! Yes, things like:

Account_Name=foobar

will work.

When such comparison is made it is in fact a comparison of Account_Name[0]=foobar and if the Account_Name is a multivalue it will iterate through all the values e.g. Account_Name[1]=foobar, Account_Name[2]=foobar ., etc.

So, while counter-intuitive, comparisons against MV fields are actually a good, working (inclusive!) approach.

And accessing the values one by one (splitting them into separate rows)? The way to deal with them is to use mvexpand e.g.:

mvexpand Account_Name

So… for any MV field, use the exclusions/inclusion in the main SPL query, then mvexpand, then apply additional logic that will work on the event split into multiple rows. It’s a bit of a (duplicated condition) pain, but this guarantees you won’t miss any events.

The last example touches big time on a topic of search optimization. The rule is simple: anything that you can do before a first pipe is your BIG win. Even if you need to narrow down results of ‘prepiped’ results it’s still usually very beneficial. It helps to narrow down the data that will be post-processed by more performance-heavy analysis/data crunching (e.g. regex). In other words, even if regex can do a more refined selection, always start with direct searches on the _raw field!

Last, but not least. I do not claim that everything stated above is true. I am still a noob SPL user and would welcome any clarification from anyone who spent more time with Splunk and SPL. Thanks in advance!

And totally last few bits of actual SPL goodness:

List all the macros:

| rest /servicesNS/-/-/admin/macros count=0

List all the alerts

| rest /services/alerts/fired_alerts splunk_server=local

| table eai:acl.owner eai:acl.app id title triggered_alert_count