Update
As Codeslack pointed out, Regexes themselves are state machines. Of course. The biggest difference between using built-in regexes and your own state machine is that you have a full control over the states and can memorize anything you want, including many previous states.
Old post
There a tones of tutorials and web sites dedicated to Regular Expressions. I love Regular Expressions and have been using them all the time, for many years… but… they have so many limitations. This is why you need to not only learn them, but also use them wisely… and often need to enrich their usage by using them as a part of a state machine.
What is a state machine?
I will say first what it is useful for: you want to use them to process data that is very hard to process using regular expressions alone or other simple data processing algorithms and tools. And even more complex ones fail to process such data in a more generic way.
Parsing of a Comma Separated Value (CSV) file is a very simple… yet extremely complex example of data that requires a state machine for proper analysis. Many beginner programmers write parsers for CSV and… fail. Interestingly… often without even noticing. Why? What they write usually works for their use case and they forget about it when they move to a different project.
A typical naive approach assumes that the CSV data is properly formatted, using English characters only (ANSI!), and there are no multiline values, special characters (e.g. quotes) are non-existent, etc. A parser based on such assumptions reads data line by line, splits each line by a coma character, and… hooray… all parsing done in 2-3 lines of code. Unfortunately, this is an awful simplification that leads to corrupted imports, and bad packages deployed all over the place.
I remember 10 years ago or so trying to read Autorunsc dumps en masse. Every forensic consultant at our company was coming back from their gigs bringing tones of various data dumps, including autorunsc CSV files from each system they live-forensicated.
Spotting an obvious data mining opportunity (after collecting so many dumps from so many gigs), I was trying to build a magic unicorn tool that would detect bad stuff on the spot, where applicable. My idea was following this train of thought: a gig consultant collects autorunsc and other data dumps –> runs my script while still on site –> if we get lucky, detects badness on site –> customer superhappy. Today this approach is referred to as a LFO (Least Frequency Occurrence) — I also used whitelisting and blacklisting, but these are known like forever. These were early days of IR, Light IR and other than GNU tools and some basic scripts there was no DB backend to process this data like we do today….
Anyway….
Excited about the outcome I went out there to see how to read CSV files in a quick way. After looking at and testing many CSV parsing Perl packages that were available at that time I couldn’t get any one of them to parse every single data file I had in a reliable way. Many were parsed properly, but a lot would leave stuff behind that was not digestible by any parser. Either Autorunsc was saving data using an incorrect CSV format, or the perl CSV packages were bad. As far as I remember, the blame was shared between these two, but knowing this didn’t help my case at all….
The enlightenment came from reading the actual CSV specification. When you skim through it you quickly realize two things:
- No one ever reads stuff like this anymore 😉
- It’s unlikely that anyone covers all angles while saving files in this format
The result is that we have many badly implemented CSV parsers out there. You also realize why: this format is NOT as simple as many people think. Quite the opposite, even today, after so many years, even Excel (which is actually including a lot of margin for error!) still fails to process some of these files correctly…
In the end, I wrote my own CSV parsing state machine that worked for my repo of autorunsc files, but would surely fail for other CSV files produced by other programs. Yes, I wrote yet another imperfect, and really bad CSV parser.
I bring this CSV anecdote to talk about state machine for a reason.
When you do data processing, you need to know many basic tricks of the trade: Excel, GNU tools, then a bit more complex like databases, or aggregation systems that allow processing of large volumes of logs, but …. finally… you really want to know how to… actually program.
Programming a state machine is fun. You start with a state equal to ‘I have no clue’. You typically assign some value to a ‘state’ variable that will hold the current state of the machine, and then start parsing data. You can parse character-by-character, word-by-word, or sentence by sentence, anything really. More advanced tokenizers that follow grammar rules can be used too (and actually are, for programming languages, JSON, XML, etc.). Anytime you read something in, you determine if the state of the machine needs to be changed. And when it reaches the state when you are ready to output something, you simply… do it…
The below code is a simple example of a state machine. We start with a state being set to 0 ($q=0). We read input line by line. As soon as we encounter 4 hexadecimal values in a row (in the input line) we change the state to $q=1 and preserve these 4 hex values in a $first_line variable.
Next time we read another line, we are in the state=1 and this time we only check if the currently read line is an empty string. If it is, it means it is an end of a hexadecimal block. If it is, we print out 4 ‘memorized’ hexadecimal values that we preserved inside the $first_line variable. We then return state to 0. This literally means we start processing input data as if we started from the top of the file. Anytime we encounter hexadecimal values (at least 4 of them), we start looking for the end of such hexadecimal block.
use strict;
use warnings;
my $q=0;
my $first_line='';
while (<>)
{
s/[\r\n]+//g;
if ($q==0)
{
if (/^(([0-9a-f]{2} ){4})/i)
{
$q=1;
$first_line=$1;
}
}
elsif ($q==1)
{
if (/^\s*$/)
{
print "$first_line\n";
$first_line='';
$q=0;
}
}
}
This may look complex, but when you start ’emulating’ in your head what happens to a file being read, you will soon realize that what this state machine is doing is simple: it looks for the first line of a hexadecimal dump in a file that stores many hexadcimal dumps. It then prints out the first line for each section. It’s pretty hard to do it using regular tools, but a simple script that follows the principle of a state machine can solve it in no time.
Parsing using regular expressions works on a boundary of a chunk of data that is predictable and systematic BUT DOESN’T NEED TO KNOW what happened before (I ignore backreferences here) – on the other hand, manual, self-implemented state machine approach allows you to record what came before and act upon it.
Anything that breaks a typical regex parsing pattern typically needs to be parsed with something more complex. Something with a MEMORY. State machine is just one of the algorithms you can use, but there are many others. In some case machine learning could help, in others you may need to use parsing that is walking through many phases, recursion, a few iterative rounds, etc.
I am not a data scientist, but anytime I write a successful state machine, I certainly feel like one. An ability to use some quick & dirty ad hoc programming to deal with unstructured, or less predictable data is a great asset in your toolkit…