How to… | Hexacorn

I may be in minority, but I do use Windows Firewall on most of my boxes. I deny all the connections by default, including some of the predefined ones, and only selectively enable some, just enough to get by with some required functionality. And anytime I need to deal with a more internet access-hungry app, I just run it from VM.

It’s hard to run some apps from VM though. Probably the most annoying bit when you have your Windows Firewall set to deny everything by default is Office 365. Its main functionality is not word or spreadsheet editing, but confirming your Office version is legitimate. To do so, and it does it all the time, it obviously needs to connect out. However, the rules one would need to set up for this to work properly are absolutely and kinda obviously, crazy. This page gives you details. Thanks Microsoft!

With Windows Firewall on, you will come across one more problem:

– access to symbols server.

It’s often great to have access to it, and yet, I don’t feel like enabling a carte blanche access to the port 80 or 443 for any reversing tools that I happen to run. So I go with IP-specific enabling rules.

And here’s the catch:

– in the past, one would need to check the IP that msdl.microsoft.com resolves to, and enable connectivity to that IP only.

Times changed though, and we live in a world of CDNs, and redirectors. As such, enabling access to msdl.microsoft.com mapping is no longer enough. This is because the latter redirects all the requests to a bunch of servers.

How do we find them?

I don’t have a generic answer, but we can cheat a bit.

You can try to use curl or wget and download the following PDB from the server (the flags I use print out a lot of debug/verbose logs which is handy):

wget -v -d https://msdl.microsoft.com/download/symbols/regedit.pdb/85B6C521417160A68521696D68568CB41/regedit.pdb

If you look at the logs your downloading tool outputs you will notice that the request is being redirected to a different Symbol Server e.g.:

https://vsblobprodscussu5shard76.blob.core.windows.net/….

So, yes, you need to find out what the IP of this server is, and voila… now your rules should work.

If you are wondering how I found this out… I checked from VM with firewall disabled. Literally, this is a regular activity for anyone who wants to keep their host OS in err… firewall denial.

Googling around for vsblobprodscussu5shard76 I came across 2 posts only, and this one is a winner in a contest of value-and-madness-adding content…; the list of possible servers goes as follows:

StorageAccount
vsblobprodscussu5shard90
vsblobprodscussu5shard9
vsblobprodscussu5shard89
vsblobprodscussu5shard88
vsblobprodscussu5shard87
vsblobprodscussu5shard86
vsblobprodscussu5shard85
vsblobprodscussu5shard84
vsblobprodscussu5shard83
vsblobprodscussu5shard82
vsblobprodscussu5shard81
vsblobprodscussu5shard80
vsblobprodscussu5shard8
vsblobprodscussu5shard79
vsblobprodscussu5shard78
vsblobprodscussu5shard77
vsblobprodscussu5shard76
vsblobprodscussu5shard75
vsblobprodscussu5shard74
vsblobprodscussu5shard73
vsblobprodscussu5shard72
vsblobprodscussu5shard71
vsblobprodscussu5shard70
vsblobprodscussu5shard7
vsblobprodscussu5shard69
vsblobprodscussu5shard68
vsblobprodscussu5shard67
vsblobprodscussu5shard66
vsblobprodscussu5shard65
vsblobprodscussu5shard64
vsblobprodscussu5shard63
vsblobprodscussu5shard62
vsblobprodscussu5shard61
vsblobprodscussu5shard60
vsblobprodscussu5shard6
vsblobprodscussu5shard59
vsblobprodscussu5shard58
vsblobprodscussu5shard57
vsblobprodscussu5shard56
vsblobprodscussu5shard55
vsblobprodscussu5shard54
vsblobprodscussu5shard53
vsblobprodscussu5shard52
vsblobprodscussu5shard51
vsblobprodscussu5shard50
vsblobprodscussu5shard5
vsblobprodscussu5shard49
vsblobprodscussu5shard48
vsblobprodscussu5shard47
vsblobprodscussu5shard46
vsblobprodscussu5shard45
vsblobprodscussu5shard44
vsblobprodscussu5shard43
vsblobprodscussu5shard42
vsblobprodscussu5shard41
vsblobprodscussu5shard40
vsblobprodscussu5shard4
vsblobprodscussu5shard39
vsblobprodscussu5shard38
vsblobprodscussu5shard37
vsblobprodscussu5shard36
vsblobprodscussu5shard35
vsblobprodscussu5shard34
vsblobprodscussu5shard33
vsblobprodscussu5shard32
vsblobprodscussu5shard31
vsblobprodscussu5shard30
vsblobprodscussu5shard3
vsblobprodscussu5shard29
vsblobprodscussu5shard28
vsblobprodscussu5shard27
vsblobprodscussu5shard26
vsblobprodscussu5shard25
vsblobprodscussu5shard24
vsblobprodscussu5shard23
vsblobprodscussu5shard22
vsblobprodscussu5shard21
vsblobprodscussu5shard20
vsblobprodscussu5shard2
vsblobprodscussu5shard19
vsblobprodscussu5shard18
vsblobprodscussu5shard17
vsblobprodscussu5shard16
vsblobprodscussu5shard15
vsblobprodscussu5shard14
vsblobprodscussu5shard13
vsblobprodscussu5shard12
vsblobprodscussu5shard11
vsblobprodscussu5shard10
vsblobprodscussu5shard1

These account names could have either of these suffixes:

{storageaccountname}.vsblob.vsassets.io
{storageaccountname}.blob.core.windows.net

Good luck…

Update

As Codeslack pointed out, Regexes themselves are state machines. Of course. The biggest difference between using built-in regexes and your own state machine is that you have a full control over the states and can memorize anything you want, including many previous states.

Old post

There a tones of tutorials and web sites dedicated to Regular Expressions. I love Regular Expressions and have been using them all the time, for many years… but… they have so many limitations. This is why you need to not only learn them, but also use them wisely… and often need to enrich their usage by using them as a part of a state machine.

What is a state machine?

I will say first what it is useful for: you want to use them to process data that is very hard to process using regular expressions alone or other simple data processing algorithms and tools. And even more complex ones fail to process such data in a more generic way.

Parsing of a Comma Separated Value (CSV) file is a very simple… yet extremely complex example of data that requires a state machine for proper analysis. Many beginner programmers write parsers for CSV and… fail. Interestingly… often without even noticing. Why? What they write usually works for their use case and they forget about it when they move to a different project.

A typical naive approach assumes that the CSV data is properly formatted, using English characters only (ANSI!), and there are no multiline values, special characters (e.g. quotes) are non-existent, etc. A parser based on such assumptions reads data line by line, splits each line by a coma character, and… hooray… all parsing done in 2-3 lines of code. Unfortunately, this is an awful simplification that leads to corrupted imports, and bad packages deployed all over the place.

I remember 10 years ago or so trying to read Autorunsc dumps en masse. Every forensic consultant at our company was coming back from their gigs bringing tones of various data dumps, including autorunsc CSV files from each system they live-forensicated.

Spotting an obvious data mining opportunity (after collecting so many dumps from so many gigs), I was trying to build a magic unicorn tool that would detect bad stuff on the spot, where applicable. My idea was following this train of thought: a gig consultant collects autorunsc and other data dumps –> runs my script while still on site –> if we get lucky, detects badness on site –> customer superhappy. Today this approach is referred to as a LFO (Least Frequency Occurrence) — I also used whitelisting and blacklisting, but these are known like forever. These were early days of IR, Light IR and other than GNU tools and some basic scripts there was no DB backend to process this data like we do today….

Anyway….

Excited about the outcome I went out there to see how to read CSV files in a quick way. After looking at and testing many CSV parsing Perl packages that were available at that time I couldn’t get any one of them to parse every single data file I had in a reliable way. Many were parsed properly, but a lot would leave stuff behind that was not digestible by any parser. Either Autorunsc was saving data using an incorrect CSV format, or the perl CSV packages were bad. As far as I remember, the blame was shared between these two, but knowing this didn’t help my case at all….

The enlightenment came from reading the actual CSV specification. When you skim through it you quickly realize two things:

No one ever reads stuff like this anymore 😉
It’s unlikely that anyone covers all angles while saving files in this format

The result is that we have many badly implemented CSV parsers out there. You also realize why: this format is NOT as simple as many people think. Quite the opposite, even today, after so many years, even Excel (which is actually including a lot of margin for error!) still fails to process some of these files correctly…

In the end, I wrote my own CSV parsing state machine that worked for my repo of autorunsc files, but would surely fail for other CSV files produced by other programs. Yes, I wrote yet another imperfect, and really bad CSV parser.

I bring this CSV anecdote to talk about state machine for a reason.

When you do data processing, you need to know many basic tricks of the trade: Excel, GNU tools, then a bit more complex like databases, or aggregation systems that allow processing of large volumes of logs, but …. finally… you really want to know how to… actually program.

Programming a state machine is fun. You start with a state equal to ‘I have no clue’. You typically assign some value to a ‘state’ variable that will hold the current state of the machine, and then start parsing data. You can parse character-by-character, word-by-word, or sentence by sentence, anything really. More advanced tokenizers that follow grammar rules can be used too (and actually are, for programming languages, JSON, XML, etc.). Anytime you read something in, you determine if the state of the machine needs to be changed. And when it reaches the state when you are ready to output something, you simply… do it…

The below code is a simple example of a state machine. We start with a state being set to 0 ($q=0). We read input line by line. As soon as we encounter 4 hexadecimal values in a row (in the input line) we change the state to $q=1 and preserve these 4 hex values in a $first_line variable.

Next time we read another line, we are in the state=1 and this time we only check if the currently read line is an empty string. If it is, it means it is an end of a hexadecimal block. If it is, we print out 4 ‘memorized’ hexadecimal values that we preserved inside the $first_line variable. We then return state to 0. This literally means we start processing input data as if we started from the top of the file. Anytime we encounter hexadecimal values (at least 4 of them), we start looking for the end of such hexadecimal block.

 use strict;
 use warnings;
 my $q=0;
 my $first_line='';
 while (<>)
 {
   s/[\r\n]+//g;
 if ($q==0)
   {
     if (/^(([0-9a-f]{2} ){4})/i)
     {
       $q=1;
       $first_line=$1;
     }
   }
 elsif ($q==1)
  {
     if (/^\s*$/)
     {
       print "$first_line\n";
       $first_line='';
       $q=0;
     }
  }
 }

This may look complex, but when you start ’emulating’ in your head what happens to a file being read, you will soon realize that what this state machine is doing is simple: it looks for the first line of a hexadecimal dump in a file that stores many hexadcimal dumps. It then prints out the first line for each section. It’s pretty hard to do it using regular tools, but a simple script that follows the principle of a state machine can solve it in no time.

Parsing using regular expressions works on a boundary of a chunk of data that is predictable and systematic BUT DOESN’T NEED TO KNOW what happened before (I ignore backreferences here) – on the other hand, manual, self-implemented state machine approach allows you to record what came before and act upon it.

Anything that breaks a typical regex parsing pattern typically needs to be parsed with something more complex. Something with a MEMORY. State machine is just one of the algorithms you can use, but there are many others. In some case machine learning could help, in others you may need to use parsing that is walking through many phases, recursion, a few iterative rounds, etc.

I am not a data scientist, but anytime I write a successful state machine, I certainly feel like one. An ability to use some quick & dirty ad hoc programming to deal with unstructured, or less predictable data is a great asset in your toolkit…

Hexacorn

Hexacorn

Category Archives: How to…

Windows Symbols A.D. 2020

State Machine vs. Regex