You are browsing the archive for How to….

State Machine vs. Regex

September 6, 2019 in Forensic Analysis, How to..., Tips & Tricks

Update

As Codeslack pointed out, Regexes themselves are state machines. Of course. The biggest difference between using built-in regexes and your own state machine is that you have a full control over the states and can memorize anything you want, including many previous states.

Old post

There a tones of tutorials and web sites dedicated to Regular Expressions. I love Regular Expressions and have been using them all the time, for many years… but… they have so many limitations. This is why you need to not only learn them, but also use them wisely… and often need to enrich their usage by using them as a part of a state machine.

What is a state machine?

I will say first what it is useful for: you want to use them to process data that is very hard to process using regular expressions alone or other simple data processing algorithms and tools. And even more complex ones fail to process such data in a more generic way.

Parsing of a Comma Separated Value (CSV) file is a very simple… yet extremely complex example of data that requires a state machine for proper analysis. Many beginner programmers write parsers for CSV and… fail. Interestingly… often without even noticing. Why? What they write usually works for their use case and they forget about it when they move to a different project.

A typical naive approach assumes that the CSV data is properly formatted, using English characters only (ANSI!), and there are no multiline values, special characters (e.g. quotes) are non-existent, etc. A parser based on such assumptions reads data line by line, splits each line by a coma character, and… hooray… all parsing done in 2-3 lines of code. Unfortunately, this is an awful simplification that leads to corrupted imports, and bad packages deployed all over the place.

I remember 10 years ago or so trying to read Autorunsc dumps en masse. Every forensic consultant at our company was coming back from their gigs bringing tones of various data dumps, including autorunsc CSV files from each system they live-forensicated.

Spotting an obvious data mining opportunity (after collecting so many dumps from so many gigs), I was trying to build a magic unicorn tool that would detect bad stuff on the spot, where applicable. My idea was following this train of thought: a gig consultant collects autorunsc and other data dumps –> runs my script while still on site –> if we get lucky, detects badness on site –> customer superhappy. Today this approach is referred to as a LFO (Least Frequency Occurrence) — I also used whitelisting and blacklisting, but these are known like forever. These were early days of IR, Light IR and other than GNU tools and some basic scripts there was no DB backend to process this data like we do today….

Anyway….

Excited about the outcome I went out there to see how to read CSV files in a quick way. After looking at and testing many CSV parsing Perl packages that were available at that time I couldn’t get any one of them to parse every single data file I had in a reliable way. Many were parsed properly, but a lot would leave stuff behind that was not digestible by any parser. Either Autorunsc was saving data using an incorrect CSV format, or the perl CSV packages were bad. As far as I remember, the blame was shared between these two, but knowing this didn’t help my case at all….

The enlightenment came from reading the actual CSV specification. When you skim through it you quickly realize two things:

  1. No one ever reads stuff like this anymore 😉
  2. It’s unlikely that anyone covers all angles while saving files in this format

The result is that we have many badly implemented CSV parsers out there. You also realize why: this format is NOT as simple as many people think. Quite the opposite, even today, after so many years, even Excel (which is actually including a lot of margin for error!) still fails to process some of these files correctly…

In the end, I wrote my own CSV parsing state machine that worked for my repo of autorunsc files, but would surely fail for other CSV files produced by other programs. Yes, I wrote yet another imperfect, and really bad CSV parser.

I bring this CSV anecdote to talk about state machine for a reason.

When you do data processing, you need to know many basic tricks of the trade: Excel, GNU tools, then a bit more complex like databases, or aggregation systems that allow processing of large volumes of logs, but …. finally… you really want to know how to… actually program.

Programming a state machine is fun. You start with a state equal to ‘I have no clue’. You typically assign some value to a ‘state’ variable that will hold the current state of the machine, and then start parsing data. You can parse character-by-character, word-by-word, or sentence by sentence, anything really. More advanced tokenizers that follow grammar rules can be used too (and actually are, for programming languages, JSON, XML, etc.). Anytime you read something in, you determine if the state of the machine needs to be changed. And when it reaches the state when you are ready to output something, you simply… do it…

The below code is a simple example of a state machine. We start with a state being set to 0 ($q=0). We read input line by line. As soon as we encounter 4 hexadecimal values in a row (in the input line) we change the state to $q=1 and preserve these 4 hex values in a $first_line variable.

Next time we read another line, we are in the state=1 and this time we only check if the currently read line is an empty string. If it is, it means it is an end of a hexadecimal block. If it is, we print out 4 ‘memorized’ hexadecimal values that we preserved inside the $first_line variable. We then return state to 0. This literally means we start processing input data as if we started from the top of the file. Anytime we encounter hexadecimal values (at least 4 of them), we start looking for the end of such hexadecimal block.

 use strict;
 use warnings;
 my $q=0;
 my $first_line='';
 while (<>)
 {
   s/[\r\n]+//g;
 if ($q==0)
   {
     if (/^(([0-9a-f]{2} ){4})/i)
     {
       $q=1;
       $first_line=$1;
     }
   }
 elsif ($q==1)
  {
     if (/^\s*$/)
     {
       print "$first_line\n";
       $first_line='';
       $q=0;
     }
  }
 }

This may look complex, but when you start ’emulating’ in your head what happens to a file being read, you will soon realize that what this state machine is doing is simple: it looks for the first line of a hexadecimal dump in a file that stores many hexadcimal dumps. It then prints out the first line for each section. It’s pretty hard to do it using regular tools, but a simple script that follows the principle of a state machine can solve it in no time.

Parsing using regular expressions works on a boundary of a chunk of data that is predictable and systematic BUT DOESN’T NEED TO KNOW what happened before (I ignore backreferences here) – on the other hand, manual, self-implemented state machine approach allows you to record what came before and act upon it.

Anything that breaks a typical regex parsing pattern typically needs to be parsed with something more complex. Something with a MEMORY. State machine is just one of the algorithms you can use, but there are many others. In some case machine learning could help, in others you may need to use parsing that is walking through many phases, recursion, a few iterative rounds, etc.

I am not a data scientist, but anytime I write a successful state machine, I certainly feel like one. An ability to use some quick & dirty ad hoc programming to deal with unstructured, or less predictable data is a great asset in your toolkit…

The art of writing (for IT Sec)

May 19, 2019 in How to..., Off-topic, Preaching, Random ideas

When I wrote my first DFIR report it was terrible. After receiving the commented version back from my reviewers my heart sunk. I felt I am not going to make it. While I love technical and investigative bit, and had some good win on that particular investigation… somehow, I was unable to communicate it. And since I always liked to write I was really surprised (a.k.a. shocked a.k.a. ego-hurt-badly).

All these hours of work put into report didn’t matter, all these cool technical bits I described didn’t matter – when the doc came back to me it was pretty much a different document… Yup, so many comments and corrections. I literally couldn’t see my original content. There was so much of ‘Adam, you are doing it wrong’… Ouch.

I must add that it was for a Law Enforcement case, so it was a big deal.

I went back and forth on these comments with my reviewers and finally…

  • Got that report into a decent shape & submitted it to the LE
  • Realized that writing for a general public or blogging is not the same as writing for DFIR, especially for LE

And it became especially clear when I received a letter to show up in court and to testify… Imagine my horror. I was a noob and yes, that absolutely terrible report was going to be talked about. And I will be questioned on its content…

Holy cow.

It’s actually pretty intimidating. Confidence from a safety of home, or office seat is one thing, but talking about your work in Court is something completely different. And the guys who ask you questions will try to break you and show you as an incompetent clown. And your report and work may lose credibility… After a mandatory panic attack I started asking around. Some of my peers went through this before and gave me many hints: only answer questions, don’t add any extra info, don’t speculate, don’t be afraid to share a professional opinion, but keep it concise, don’t get emotional, watch out for attempts to dismiss your evidence, or target your credibility (personal attacks, etc), etc. So… YES. That was pretty intimidating, to say the least.

I kinda got lucky on that one and eventually didn’t go to testify, because the guy pleaded guilty (my report actually helped to persuade him!!!), but from there on I learned to be more careful, more humble, and definitely more organized with regards to what I write, especially commercially.

It’s really easy to make claims, it’s much harder to support/describe evidence, build a proper case, argument, timeline, or in case there is no evidence at least offer an educated guess, share professional opinion to support them (including contextualizing circumstantial evidence).

Think about it for a second: from a DFIR perspective we use a lot of tools to extract and interpret evidence. While we are happy building timelines, the whole process of data extraction and interpretation could be called into a question. How do we know, or how are we so sure the programs we use extract and interpret data correctly?

Notably, what you know, or what you think you know will be scrutinized in any possible way, so as you write your report you do need to re-read a lot of older documents, or reference materials to avoid making a mistake of making a statement that is easy to prove to be incorrect, inaccurate, or too general. This may ruin your case. To give you an example… Say… you describe that programs always load in a certain way under Windows, and that’s the only way to run programs. Be careful not to make an overstatement or misrepresentation. As it turns out, there is a lot of other ways to run code on Windows, whether via shellcode, exploits, side-loading, etc. The moment you are caught with statements that can be proven inaccurate your credibility may suffer.

This is where this article begins.

Whether you write a DFIR report, pentesting report, malware write-up, Threat Intel doc, or just fill-in the ticket or even post on the blog or Twitter think for a second of the following:

  • Who is your audience?
  • Who is your audience that you don’t know of?
    • Tickets are often reviewed by Compliance/Audit teams
    • Your most Senior Management may do it one day, even if whimsically
    • In case of a breach, tickets related to the breach-related events/incidents may become evidence in Court
  • How accurate is your description?
    • Did you write about facts or shared an opinion?
    • Did you use language that may not be fit for the purpose? Slang, vulgarisms, personal opinions, puns, jokes, commentary, etc. have no place in these cases
    • Can a non-technical person understand what you wrote? Will they understand how it will affect them?
    • If it is a ticket, is there a closure? You shouldn’t close tickets with no closure statements even if it’s just a simple ‘Based on the investigation, there is no further risk, and the ticket can be closed’; it helps you, helps your manager, and helps the org if these statements are there
  • Will the audience focus on the headline only, summary, or gore details?
  • Are you the first one to publish about it? Do your homework – and always give credit to any relevant older research, if you can find it. Update your post, if you find the references later, or someone provides you with a link (you will be surprised how many time people send me links to web.archive.org where some long forgotten blog/PDF from early noughties discusses some similar topic I just wrote about thinking it’s a novelty)
  • Assume that at least one person will come back to you with comments that will bring a revolution to your thought process (e.g. to point out gaps in your thinking, suggest/reference older /often better/ research on the same topic, or better, more efficient approach to the same problem); anticipate it and accept it in a humble way; remember to thank these guys – they not only read your stuff, they enrich your knowledge!!
  • Assume you may need to explain your claims in ELI5 fashion
    one day and finally…
  • If possible, describe what you did so it can be replicated, and/or re-analyzed; share code, data, examples, queries, attach files, results, add comments how you interpreted them.

This sounds trivial and kinda overdone, right? Let’s see…

  • Twitter is mainly opinions – who cares
  • Tickets’ content is almost never read by anyone – who cares
  • Blogs are blogs – who cares
  • Malware reports are now so generic that they are primarily part of a PR machine, and are actually really easy to write (most of the time=quick intro, some IDA/Olly/Xdbg/Ghidra/DNSpy screenshots of walking through malware stages, finally a conclusion with a marketing bit and then yara+IOCs; they can also be semi-automatically generated from sandboxes – who cares
  • Red Team/ Pentest reports are also semi-automated in many ways, and often just focus on an extensive list of vulnerabilities found by scanners, or ‘I pwned you, patch your systems, kthxbye’ bit if they managed to actually compromise some systems; notably, red teams, similarly to DFIR teams need a lot of willpower and incentive to keep logs of all the steps they take; why? because it’s often poking around w/o any success for many hours; it’s when they hit the jackpot, they immediately chase the leads (DFIR) or explore new paths (red team); this is _hard_ to document, because excitement takes over – still, who cares
  • DFIR reports, even if still manually written, more and more suffer/benefit from an automation too; copypasta and generalizations are a norm, and a predictable TOC (often enforced by standards e.g. in PFI breaches) is there too
  • Finally, Threat Intel is a kinda beast on its own; from literal forwards of PDFs through copypasta exercises to actual valuable intel pieces affecting your org (it was very bad a few years ago, but it’s getting better and better).

Notably, other industries suffer from templates and copypasta as well, so it’s not a phenomenon that is infosec-centric. So many T&S, commercial reports, surveys, searches, etc. are not only non-conclusive, but almost all of them are written in a ‘we don’t don’t take any responsibility’ way. With regards to searchers, reports they are also typically direct exports from databases and while in some cases may get enriched by a quick, yet superficial ‘personal touch’ to make it more credible, they are just an easy source of revenue for companies that own these databases. Sadly, infosec is following these steps. And while we are all pressured by time, and billable hours is what matters… it will be quite a shame if we end up delivering the same vague content as a part of BAU (Business As Usual).

This is where this article begins being practical.

Lenny Zeltser published Writing Tips for IT Professionals. If you have not read it, please do so. This is a great tutorial on how to be strategic about your writing.

Also, for anything you write assume that LE, C-level guys, firms engaged commercially to re-do/confirm/audit your DFIR / pentest analysis, experts in the industry will read it at some stage. Also… assume these reports will become public.. cuz… breaches.

So… try to write in a defensive way, make your lack of knowledge known (where applicable). Suggest avenues for additional research if you can. Don’t claim anything 100%, but at the same time use common sense so that your article doesn’t overuse words like ‘allegedly’, ‘possibly’, ‘probably’, ‘reportedly’, ‘supposedly’, etc.. Be honest, be humble. Focus on facts, not editorializing.

Also… use Alexious Principle, it’s such a simple, yet powerful recipe for writing almost any report/write-up within an infosec space in a defensive way. If you include these 4 points it’s almost guaranteed that all the questions asked by a client, LE, sponsor will be addressed. The less follow-ups on the report, the better writer you are.

Finally, you need to practice. The more you write, the better you will get at it. Also, read documents that are within the same audience spectrum — if you need to write DFIR reports, read available public reports about breaches. Cherry-pick language, statements, as well as formatting style, and the document organization.

And last, but not least – do peer review, if possible. Ask more senior guys to look at what you write. Ask them if there is anything that sounds too vague. Correct it.

And to be honest, this post is a good example of bad writing. I mixed up a lot of things and didn’t have much structure here; if you read that far, thank you.