Couple of Splunk/SPL Gotchas, Part 2

It’s been nearly 5 years since I dropped this old post about Splunk Gotchas. Okay, in fairness, I also covered SPL-based, path normalization in 2020 & bitmap-based hunting aka bitmap hunting here and here in 2024. Still… I thought it would be nice to re-visit this topic, because there is so much to talk about…

Now, I will cover a few gotchas here but before I do so, I want to state as follows: I am fully aware that out of all these presented, more are more, and few are less… known… after all I am still a SPL n00b, so you may already know it all. I don’t claim any of it is groundbreaking in any way, but hope it may still help someone…

Gotcha #1 – improving readability of file permissions

Let’s take a look at file permission values that we often get from logs f.ex. from sources like ossec (popular Host Intrusion Detection System/File Integrity Monitor). Its logs are a bit cryptic, but once you extract the preserved permission attributes for a given file (for simplicity, we assume 3 octets here that translate to a set of ‘R’, ‘W’, ‘X’ rights) into say oct1, oct2, oct3, we can use the below loop to convert all these octets into actual string representation of the right:

| foreach oct* [
      eval <<FIELD>>=replace(<<FIELD>>,"0","---")
   |  eval <<FIELD>>=replace(<<FIELD>>,"1","--X")
   |  eval <<FIELD>>=replace(<<FIELD>>,"2","-W-")
   |  eval <<FIELD>>=replace(<<FIELD>>,"3","-WX")
   |  eval <<FIELD>>=replace(<<FIELD>>,"4","R--")
   |  eval <<FIELD>>=replace(<<FIELD>>,"5","R-X")
   |  eval <<FIELD>>=replace(<<FIELD>>,"6","RW-")
   |  eval <<FIELD>>=replace(<<FIELD>>,"7","RWX")
 ]

So, for an example file permissions rights represented in octets as ‘077’, we can do this:

I find presenting *nix file attributes in such readable way to be very beneficial to the analyst (the less we need to guess about what we see on dashboards or in alerts/reports, the better).

The foreach command is a very handy SPL construct for many situations where we have to apply the very same business logic to multiple fields; that is, as long as their names can be addressed/referenced with a wildcard (in this case field names follow the incremental pattern: oct1, oct2, oct3 and can be represented as a wild card oct*).

Gotcha #2 – summary indexes

Everyone in the Splunk ecosystem knows and raves about tstats, but imho summary indexes are far more sexy! They allow us to cherry-pick a small subset of ‘interesting’ events from the gazillions of all incoming events, temporarily save them to a specific place (yes, summary index!), and often in some normalized/efficient/data enriched way (drop fields, do some patter recognition and normalization, add GeoIP data, etc.), so that querying this (far smaller!) snapshot of data is not only faster, but also more focused.

For example, a dedicated summary index (or indexes) could be used to store all observed invocations of say ‘rundll32.exe’, ‘mshta.exe’, ‘cscript.exe’, ‘wscript.exe’, ‘powershell.exe’, ‘pwsh.exe’ (on Windows systems), or perhaps all command line referencing ‘curl’ or ‘wget’ on Linux/Windows/macOS systems.

How cool is that?

You can literally drive a lot of commandline-based research this way. Leading to discovering new invocations, new tricks, or just using it to define per/endpoint base lines… it’s all there, in your summary index, easy to digest, easier to back-up, and fundamentally – a real threat hunting value amplifier.

Other uses of summary index can include building your own, custom implementation of a dynamic asset inventory (f.ex. any new host not present in the summary index and reporting ANY event can be added). And finally, while it is not humble to day, a few years ago my friend (much better Splunker than I am!) and I implemented a custom RBA (Risk Based Alerting) solution using Summary Indexes too! Data reduction and research opportunities that summary indexes offer cannot be understated…

Gotcha #3 – counting a number of character occurrences in a string

There is an interesting problem that we sometimes come across during threat hunting – how to count a number of occurrences of a specific character in a given string. And there are so many uses for this simple (yet difficult to implement) algorithm. For instance, one may ask – given a path, how deep is a nesting level of its folder structure? While we can answer it simply by counting how many forward/backward slashes we can count in a given path, the actual SPL implementation of this algo is not easily available, because there are no functions available yet that can facilitate it (AFAIK).

Here’s the gotcha: we can get a length of a string (path) and preserve it. We then replace all characters of interest (say: forward/backward slashes) with… nothing, and calculate this new string’s length. The difference between these two lengths is the value we are looking for!

Gotcha #4 – excluding long command line invocations

Analysis of command line invocations is HARD. We can pretend all we can, but in reality this is one of the hardest problems of process telemetry analysis. We obviously use exclusions, focus on specific programs, patterns, but if we still want to do a due diligence we actually need to end up clustering it all to make sure we ‘see it all’…

I find the length of logged commands to be an interesting property to play around with. If the process name is related to Java, Google Chrome, Microsoft Edge and other chrome-based browsers, or the invocations are coming from any type of macos/linux shell, you will see a lot of VERY long command lines… so, if the command line is that long, we can often simply exclude it from our main searches/summaries. BUT to avoid any misses, we can always add a compensating control that is specifically looking at these excluded, long command lines. And these may also be often normalized using various regular expression-based templates f.ex. focused on GUIDs, timestamp patterns, version numbers, etc. (see my old post for more details).

I believe that Divide and Conquer approach is fully justified.

Gotcha #5 – adding line breaks

The SPL does not support new lines very well, so our output can often get messy… but… we can always use this construct:

| rex field=<filed_name> mode=sed "s/<something>/\n/g"

to add actual new lines to the output f.ex:

Gotcha #6 – URLs are… VERY complicated

The last gotcha is related to URLs. Anyone who did some SPL work for some time is familiar with the ut_parse_extended command that is a part of URL Toolbox.

The internet as we know is no longer English-centric. Support for Unicode-based domain names is omni-present and we can’t use basic regexes to parse URLs anymore. The URL Toolbox takes care of all these nuances offering URL parsing that supports non-English characters, less known TLD suffixes, and IDN.

The art of artifact collection and hoarding for the sake of forensic exclusivity… – Part 5

If you follow this series you should know by now that I am obsessing here not about the benefits of piracy, but about a new, powerful forensic capability: a truly actionable summary (extracted from the ever-growing evidence…).

In my previous posts in this series I covered a number of different approaches one may take to analyze forensic and telemetric data obtained from an individual system or a cluster of endpoints belonging to a specific org, but here’s one more approach: wikipedia’s categorization feature.

You may or may not be aware that there are wikipedia tools available online that allow us to extract a subset of wikipedia database that meets certain criteria: f.ex. one can select all wikipedia pages that are tagged with a certain category and export it to a file. And lo-and behold – for our ‘software categorization purposes there is a really interesting category we should look at: Lists_of_software:

When we click ‘Add’ we will immediately populate the list of pages:

And when we click ‘Export’ we will get a relatively small XML file listing all the pages of interest…

Now, that list of pages is interesting on its own, because we get a really long list of nice categories – 484 (242 unique) entries (as of today) — see: wiki_pages.txt (based on the page list) and wiki_pages_unique.txt (based on the <title> entries from the exported xml file).

Secondly, when you parse this exported XML file, you will end up with a list of software names, vendor names, domains that can be now used to… yes… categorize software we find during the forensic exams! Luckily to us, most of these legitimate software packages listed on wikipedia follow some sort of naming convention schemes – they allow us to recognize them, especially when they are installed to their own, preprogrammed paths.

Thirdly, when you review that exported XML file you will quickly realize how many of these thousands of software packages you never heard of. This is a humble lesson for any Detection Engineering adept out there – we can’t pretend anymore that we are on top of things. Every single software is a potential source of a supply-chain attack. Every single software may be introducing Local Privilege Escalation bugs. Every single software may include new lolbins. Every single software may offer functionality that can and will be abused by attackers. And this is just a software listed on wikipedia. There are gazillions of other software installs out there that have never been looked at, never been scrutinized, never been assessed from a security standpoint.

I will be exploring many of them in my future posts in this series.