{"id":7206,"date":"2020-05-23T12:41:46","date_gmt":"2020-05-23T12:41:46","guid":{"rendered":"http:\/\/www.hexacorn.com\/blog\/?p=7206"},"modified":"2020-05-24T16:43:30","modified_gmt":"2020-05-24T16:43:30","slug":"normalizing-our-path-to-splunk-enlightenment","status":"publish","type":"post","link":"https:\/\/www.hexacorn.com\/blog\/2020\/05\/23\/normalizing-our-path-to-splunk-enlightenment\/","title":{"rendered":"Normalizing our path to Splunk enlightenment"},"content":{"rendered":"\n<p>One of the most annoying bits that we come across while doing log analysis is both predictability and unpredictability of file paths. Somehow&#8230;. everyone really&#8230; vendors, admins, and finally users keep coming up with new ways to name files that store their data. Combing through this mess  feels like everything is an outlier.<\/p>\n\n\n\n<p>I apply a simple strategy to get rid of the most predictable paths by using the Least Frequency Occurrence (LFO), counted across hosts (using <em>dc(host)<\/em>). If one path exists on say&#8230; at least 3 hosts, then it is less likely to be malicious than one that is found on one system only.<\/p>\n\n\n\n<p>When you start using this technique you will come across many obstacles because lots of paths use random bits. Many of them are kinda predictable, it&#8217;s just the patterns keep piling up. And this is where normalization helps a lot.<\/p>\n\n\n\n<p>Normalizing data is pretty easy &#8211; we use regex-based <em>replace<\/em> function to remove unnecessary junk. You have to start with the most precise patterns and then go towards the vague, hope-for-the-best ones. <\/p>\n\n\n\n<p>Precise ones include:<\/p>\n\n\n\n<ul><li>GUID<\/li><li>SID<\/li><li>Date and time formats (e.g. YYYYMMDD)<\/li><li>System32|SysWOW64|sysnative paths<\/li><li>Program Files variants<\/li><li>User folders<\/li><li>Versioning information <\/li><li>Hashes or hash-lookalikes<\/li><li>etc.<\/li><\/ul>\n\n\n\n<p>One has to hope that these will reduce a lot of noise, and if it is still not the case we can go a bit further and start using more vague patterns e.g.:<\/p>\n\n\n\n<ul><li>full stop followed by digits, especially at the end of directory names<\/li><li>decimal numbers at the end of the path\/folder<\/li><li>multiple digits in a row<\/li><li>hexadecimal numbers<\/li><li>underscore followed by alphanumeric\/digits characters<\/li><li>etc.<\/li><\/ul>\n\n\n\n<p>It&#8217;s pretty hard to talk \/ learn about it without actual doing it hence I attached a test data set for you to play with (see bottom of this post).<\/p>\n\n\n\n<p>The test set contains a number of fictional hosts (named after islands) and a bunch of paths that I made up so that we can demonstrate how the LFO and normalization can work in tandem. After you download the test set, you can import it to Splunk using the following name: <em>test_dataset_paths.csv<\/em> (I use it in my examples).<\/p>\n\n\n\n<p>To confirm data is accessible via  alookup file you can run this command in Splunk:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">| inputlookup test_dataset_paths.csv<\/pre>\n\n\n\n<p>The set includes Paths from 6 hosts:<\/p>\n\n\n\n<ul><li>ARUBA, TAHITI, FIJI, ANTIGUA, BALI, BARBUDA<\/li><\/ul>\n\n\n\n<p>belonging to 6 users:<\/p>\n\n\n\n<ul><li>John, James, Paul, Kate, Joan, Alice<\/li><\/ul>\n\n\n\n<p>John, James, and Paul installed Firefox, and Kate, Joan and Alice &#8211; Chrome. John is the only user who has his system infected.<\/p>\n\n\n\n<p>When we run the inputlookup command we can immediately see that even with a small number of rows it&#8217;s not that easy to comb through it:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><a href=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl1.png\"><img decoding=\"async\" src=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl1.png\" alt=\"\" class=\"wp-image-7207\" width=\"500\" srcset=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl1.png 613w, https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl1-201x300.png 201w\" sizes=\"(max-width: 613px) 100vw, 613px\" \/><\/a><\/figure>\n\n\n\n<p>Even if we want to do stats over it (per Path):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">| inputlookup test_dataset_paths.csv<br>| stats values(Host) as Hosts by Path<\/pre>\n\n\n\n<p>we get this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl2.png\"><img decoding=\"async\" loading=\"lazy\" width=\"937\" height=\"945\" src=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl2.png\" alt=\"\" class=\"wp-image-7208\" srcset=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl2.png 937w, https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl2-297x300.png 297w, https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl2-150x150.png 150w, https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl2-768x775.png 768w\" sizes=\"(max-width: 937px) 100vw, 937px\" \/><\/a><\/figure>\n\n\n\n<p>It&#8217;s only after we apply normalization we get a data set that makes it easier for us to decide what to discard:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">| inputlookup test_dataset_paths.csv\n| eval norm_path = Path\n| eval norm_path = replace(norm_path, \"(?i)c:\\\\users\\\\[^\\]+\", \"\")\n| stats values(Path) as Paths values(Host) as Hosts count by norm_path<\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl3.png\"><img decoding=\"async\" loading=\"lazy\" width=\"983\" height=\"945\" src=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl3.png\" alt=\"\" class=\"wp-image-7209\" srcset=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl3.png 983w, https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl3-300x288.png 300w, https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl3-768x738.png 768w\" sizes=\"(max-width: 983px) 100vw, 983px\" \/><\/a><\/figure>\n\n\n\n<p>Since we see repetitions of identical normalized path across multiple systems, and some paths are clearly present on all the systems, we can remove them from the view using <em>dc<\/em> where number of hosts is at least 3:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">| inputlookup test_dataset_paths.csv<br>| eval norm_path = Path<br>| eval norm_path = replace(norm_path, \"(?i)c:\\\\users\\\\[^\\]+\", \"\")<br>| stats values(Path) as Paths dc(Host) as dch values(Host) as Hosts count by norm_path<br>| where dch &lt; 3<\/pre>\n\n\n\n<p>This brings us to the final result that literally shows the malware:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl4.png\"><img decoding=\"async\" loading=\"lazy\" width=\"983\" height=\"945\" src=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl4.png\" alt=\"\" class=\"wp-image-7210\" srcset=\"https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl4.png 983w, https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl4-300x288.png 300w, https:\/\/www.hexacorn.com\/blog\/wp-content\/uploads\/2020\/05\/spl4-768x738.png 768w\" sizes=\"(max-width: 983px) 100vw, 983px\" \/><\/a><\/figure>\n\n\n\n<p>If you are wondering why I am using <em>dc<\/em> and not <em>count<\/em> for exclusions, it&#8217;s because your data set can include repetitions from the same host (in such case <em>count<\/em> would be higher, while still applying to a single host); <em>dc <\/em>gives you number of all hosts on which specific Path occurred at least once.<\/p>\n\n\n\n<p>I know it&#8217;s trivial and probably used by any splunker out there, but I remember that when I started learning SPL I was working with huge datasets from the start and it was quite overhwelming. Working with a small custom-made set makes it easier to test ideas and&#8230; regexes (especially these that require guessing how many backlashes to put to escape characters properly). <\/p>\n\n\n\n<p>And speaking of the devil&#8230; here is a bunch of <em>replace<\/em> function regex examples you can consider using:<\/p>\n\n\n\n<ul><li>Users on PC<ul><li>(?i)c:\\\\\\users\\\\\\[^\\]+<\/li><\/ul><\/li><li>Users on MAC<ul><li>(?i)^\/Users\/[^\/]+<\/li><\/ul><\/li><li>GUID<ul><li>(?i)[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}<\/li><\/ul><\/li><li>Timestamp (one of many variants; it&#8217;s NOT precise, but good enough for day to day work)<ul><li>(?i)20[12][0-9][01][0-9][0-3][0-9][0-9]+\\.[0-9]+<\/li><\/ul><\/li><li>Timestamp using actual names of months<ul><li>(?i)\\d+(jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|june?|july?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\\d+<\/li><\/ul><\/li><li>SID<ul><li>(?i)S-.-.-..-[0-9]+-[0-9]+-[0-9]+-[0-9]<\/li><\/ul><\/li><li>Options (command line)<ul><li>(?i) -+[a-z0-9-_]+=[^ ]+<\/li><\/ul><\/li><li>Random directory with a tilde followed by digits<ul><li>(?i)~\\d+\\\\\\\\&#8221; <\/li><\/ul><\/li><\/ul>\n\n\n\n<p>They may be buggy (both logic and formatting of this blog may affect it), so treat with caution and always check against your data set. It is mainly to show a couple of ideas that can help to start.<\/p>\n\n\n\n<p>Using LFO, normalization and with a few other tricks (e.g. filtering by additional fields, counting per hosts, directories, bucketing, adding weights for risk-based scoring) leads us to a very favorable outcome: we stop using inclusion\/exclusion lists i.e. we stop using signatures.<\/p>\n\n\n\n<p>The data set is <a href=\"https:\/\/hexacorn.com\/examples\/test_dataset_paths.csv\">here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the most annoying bits that we come across while doing log analysis is both predictability and unpredictability of file paths. Somehow&#8230;. everyone really&#8230; vendors, admins, and finally users keep coming up with new ways to name files that &hellip; <a href=\"https:\/\/www.hexacorn.com\/blog\/2020\/05\/23\/normalizing-our-path-to-splunk-enlightenment\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[86],"tags":[],"_links":{"self":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/7206"}],"collection":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/comments?post=7206"}],"version-history":[{"count":4,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/7206\/revisions"}],"predecessor-version":[{"id":7227,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/posts\/7206\/revisions\/7227"}],"wp:attachment":[{"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/media?parent=7206"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/categories?post=7206"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.hexacorn.com\/blog\/wp-json\/wp\/v2\/tags?post=7206"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}