You are browsing the archive for Forensic Analysis.

Re-sauce, Part 1

April 24, 2020 in Archaeology, Clustering, File Formats ZOO, Forensic Analysis

PE Resources are like an unwanted child of malware analysis and reverse engineering. Almost no one talks about them and… this post is going to… make it worse ;).

Let’s take a large number of ‘bad’ samples, export their resource information, and do some data crunching… we now have some stats.

What are the most popular resources?

These are:

4720830   RT_ICON (3) -
4703093   RT_GROUP_ICON (14) -
3445748   RT_VERSION (16) -
2574034   RT_MANIFEST (24) -
2291058   RT_DIALOG (5) -
2022739   RT_STRING (6) -
1564623   RT_RCDATA (10) -
1193659   RT_BITMAP (2) -
1159726    'DVCLAL' -
1050941    'PACKAGEINFO' -
 931572    'MAINICON' -
 903265   RT_CURSOR (1) -
 884868   RT_GROUP_CURSOR (12) -
 557473    'BBABORT' -
 551898    'BBALL' -
 551836    'BBOK' -
 551785    'BBNO' -
 551023    'BBRETRY' -
 542886    'BBIGNORE' -
 542836    'BBHELP' -
 542834    'BBCLOSE' -
 542593    'BBYES' -
 541708    'BBCANCEL' -
 498816    'PREVIEWGLYPH' -
 497272    'DLGTEMPLATE' -
 358081   RT_MENU (4) -
 199615    'TFORM1' -
 174781   RT_ACCELERATOR (9) -

These with a RT_prefix are standard resource types defined by Microsoft, and the ones in apostrophes are strings that ‘tag’ (or ‘name’) the resources according to developer’s wishes…

Given a number of these ‘named’ ones used repeatedly (as shown by the list above) you can guess that they are somehow ‘known’, or a part of some ‘standard’ — and yup, these are primarily from Borland/Delphi/Embarcadero family of executables that include standard GUI elements from this platform. All ‘BB*’ and ‘T*’ come from this environment. Additionally ‘PACKAGEINFO’ is a resource I covered a little bit in the past – it lists all the packages the executable uses (a good IOC except no one writes malware in Delphi anymore).

Surprisingly, modern PE Viewers and Editors do not parse PE resources very well. They only show the most popular resource types, because the others are often … undocumented. I really don’t like to look at resources in hex view. We can do better.

Let’s start with these that are ‘kinda documented’.

For instance, Resource Hacker can handle some Delphi resources (e.g. forms) pretty well:

A popular ‘Typelib’ resource:

can be viewed with OleView:

The ‘Registry’ is typically an embedded ‘.reg’ file.

A ‘FOMB’ is a binary MOF that was described in this post by FireEye and can be decoded using bmfdec.

What about the others?

And this is where it gets really difficult…

Looking at resources embedded in Windows 10 exe, dll, ocx files one can very quickly build a list of more or less enigmatically-looking resource names:

  • ACCELERATOR
  • ANICURSOR
  • AVI
  • BINARY
  • BITMAP
  • BITMAP4
  • BRANDING_METADATA_RES
  • BRANDING_REQUIRED_RESOURCEID_MAP
  • CERT
  • CODEPAGES
  • CODEPAGESEXT
  • CURSOR
  • DATA_FILE
  • DATAFILERESOURCE
  • DGML
  • DIALOG
  • DUI
  • EDPAUTOPROTECTIONALLOWEDAPPINFOID
  • EDPENLIGHTENEDAPPINFOID
  • EDPPERMISSIVEAPPINFOID
  • EMBEDDEDDATA
  • FILES
  • FLEX_TABLE
  • FLEXDL
  • FONT
  • FONTDIR
  • FONTFALLBACK
  • GIF
  • GROUP_CURSOR
  • GROUP_ICON
  • HTML
  • HWB
  • HWXLANGID
  • IBC
  • ICON
  • IMAGE
  • JPEG
  • JS
  • JSON
  • JSON_RESPONSE
  • MANIFEST
  • MENU
  • MESSAGETABLE
  • MOFDATA
  • MSTESTROOT
  • MUI
  • PNG
  • PNGFILE
  • PRELOAD
  • PRXFILE
  • RCDATA
  • REGINST
  • REGISTRY
  • RGSLIST
  • SCHEMA
  • SIAMDB
  • SKDFILE
  • SRGRAMMAR
  • STYLE_XML
  • TESTROOT
  • TEXT
  • TEXTINCLUDE
  • TUNINGSPACE
  • TYPELIB
  • UIFILE
  • VR_ETW_MANIFEST
  • VR_ETW_RESOURCE
  • VSGEXP
  • WAVE
  • WEVT_TEMPLATE
  • XML
  • XML_FILE
  • XML_SCHEMA
  • XMLFILE
  • XSD
  • XSDFILE
  • XSLFILE

Yup. Some are easy to handle (just by looking at their name e.g. AVI, BITMAP, XML), but… this is just Windows 10.

Time will tell if we will ever see a PE editor/viewer that can handle all of, or at least most of these well.

In the meantime…

Resources is something you may want to look at more closely. Starting today.

Why?

Because of this guy:

I got it from resources of Norton SecureWorks circa 2002-2003. Do you even remember this software existed?

One of cool side-effects of poking around in many resources is coming across weird, unusual strings, texts, images, movies, you name it. You will find developer pictures that were not meant for general public, ‘tagging’ images with names of developers of project managers, jokes, and whatever else. Yes, there is cheezy, there is porn, there are obscenities, there are also Easter Eggs.

If you want to start building your own collection, it couldn’t be easier…

You can simply use:

  • 7z l <filename> .rsrc
    • to list all the resources of a <filename>
  • 7z x <filename> .rsrc’
    • to extract them.

And then start data crunching:

  • Icons are interesting, especially if re-used for malicious purposes (e.g. Adobe, Microsoft) –> there are existing yara sigs for these!
  • Manifest may include references to other executables/DLLs loaded
  • Manifest may also include references to rights required for running the executable (e.g. look for level=”requireAdministrator”)
  • Language information may be helpful with attribution (beware of false flags)
  • Version Information lists lots of interesting information that can be co-related with the information extracted from certificates / signatures, if present
  • Delphi resources are fairly well documented and can be extracted, especially the aforementioned package names — can help to at least cluster samples as per the modules used (may sometimes highlight similar families, plus good for yara sigs)
  • Everything else should be extracted and checked against typical file types/magic:
    • BMP
    • PNG
    • GIF
    • JPG
    • AVI
    • Wav
    • Rtf
    • Ico
    • Cur
    • PE files
    • LE files (older version of MZ executables)
    • MZ files (yup, plain DOS)
    • UTF8/Unicode BOMs
    • Office files
    • etc.

Resources are a very important metadata source for analysts. If you are lucky you may not only get the visuals, but also timestamps (e.g. in Delphi executables).

Be err… resourceful.

State Machine vs. Regex

September 6, 2019 in Forensic Analysis, How to..., Tips & Tricks

Update

As Codeslack pointed out, Regexes themselves are state machines. Of course. The biggest difference between using built-in regexes and your own state machine is that you have a full control over the states and can memorize anything you want, including many previous states.

Old post

There a tones of tutorials and web sites dedicated to Regular Expressions. I love Regular Expressions and have been using them all the time, for many years… but… they have so many limitations. This is why you need to not only learn them, but also use them wisely… and often need to enrich their usage by using them as a part of a state machine.

What is a state machine?

I will say first what it is useful for: you want to use them to process data that is very hard to process using regular expressions alone or other simple data processing algorithms and tools. And even more complex ones fail to process such data in a more generic way.

Parsing of a Comma Separated Value (CSV) file is a very simple… yet extremely complex example of data that requires a state machine for proper analysis. Many beginner programmers write parsers for CSV and… fail. Interestingly… often without even noticing. Why? What they write usually works for their use case and they forget about it when they move to a different project.

A typical naive approach assumes that the CSV data is properly formatted, using English characters only (ANSI!), and there are no multiline values, special characters (e.g. quotes) are non-existent, etc. A parser based on such assumptions reads data line by line, splits each line by a coma character, and… hooray… all parsing done in 2-3 lines of code. Unfortunately, this is an awful simplification that leads to corrupted imports, and bad packages deployed all over the place.

I remember 10 years ago or so trying to read Autorunsc dumps en masse. Every forensic consultant at our company was coming back from their gigs bringing tones of various data dumps, including autorunsc CSV files from each system they live-forensicated.

Spotting an obvious data mining opportunity (after collecting so many dumps from so many gigs), I was trying to build a magic unicorn tool that would detect bad stuff on the spot, where applicable. My idea was following this train of thought: a gig consultant collects autorunsc and other data dumps –> runs my script while still on site –> if we get lucky, detects badness on site –> customer superhappy. Today this approach is referred to as a LFO (Least Frequency Occurrence) — I also used whitelisting and blacklisting, but these are known like forever. These were early days of IR, Light IR and other than GNU tools and some basic scripts there was no DB backend to process this data like we do today….

Anyway….

Excited about the outcome I went out there to see how to read CSV files in a quick way. After looking at and testing many CSV parsing Perl packages that were available at that time I couldn’t get any one of them to parse every single data file I had in a reliable way. Many were parsed properly, but a lot would leave stuff behind that was not digestible by any parser. Either Autorunsc was saving data using an incorrect CSV format, or the perl CSV packages were bad. As far as I remember, the blame was shared between these two, but knowing this didn’t help my case at all….

The enlightenment came from reading the actual CSV specification. When you skim through it you quickly realize two things:

  1. No one ever reads stuff like this anymore 😉
  2. It’s unlikely that anyone covers all angles while saving files in this format

The result is that we have many badly implemented CSV parsers out there. You also realize why: this format is NOT as simple as many people think. Quite the opposite, even today, after so many years, even Excel (which is actually including a lot of margin for error!) still fails to process some of these files correctly…

In the end, I wrote my own CSV parsing state machine that worked for my repo of autorunsc files, but would surely fail for other CSV files produced by other programs. Yes, I wrote yet another imperfect, and really bad CSV parser.

I bring this CSV anecdote to talk about state machine for a reason.

When you do data processing, you need to know many basic tricks of the trade: Excel, GNU tools, then a bit more complex like databases, or aggregation systems that allow processing of large volumes of logs, but …. finally… you really want to know how to… actually program.

Programming a state machine is fun. You start with a state equal to ‘I have no clue’. You typically assign some value to a ‘state’ variable that will hold the current state of the machine, and then start parsing data. You can parse character-by-character, word-by-word, or sentence by sentence, anything really. More advanced tokenizers that follow grammar rules can be used too (and actually are, for programming languages, JSON, XML, etc.). Anytime you read something in, you determine if the state of the machine needs to be changed. And when it reaches the state when you are ready to output something, you simply… do it…

The below code is a simple example of a state machine. We start with a state being set to 0 ($q=0). We read input line by line. As soon as we encounter 4 hexadecimal values in a row (in the input line) we change the state to $q=1 and preserve these 4 hex values in a $first_line variable.

Next time we read another line, we are in the state=1 and this time we only check if the currently read line is an empty string. If it is, it means it is an end of a hexadecimal block. If it is, we print out 4 ‘memorized’ hexadecimal values that we preserved inside the $first_line variable. We then return state to 0. This literally means we start processing input data as if we started from the top of the file. Anytime we encounter hexadecimal values (at least 4 of them), we start looking for the end of such hexadecimal block.

 use strict;
 use warnings;
 my $q=0;
 my $first_line='';
 while (<>)
 {
   s/[\r\n]+//g;
 if ($q==0)
   {
     if (/^(([0-9a-f]{2} ){4})/i)
     {
       $q=1;
       $first_line=$1;
     }
   }
 elsif ($q==1)
  {
     if (/^\s*$/)
     {
       print "$first_line\n";
       $first_line='';
       $q=0;
     }
  }
 }

This may look complex, but when you start ’emulating’ in your head what happens to a file being read, you will soon realize that what this state machine is doing is simple: it looks for the first line of a hexadecimal dump in a file that stores many hexadcimal dumps. It then prints out the first line for each section. It’s pretty hard to do it using regular tools, but a simple script that follows the principle of a state machine can solve it in no time.

Parsing using regular expressions works on a boundary of a chunk of data that is predictable and systematic BUT DOESN’T NEED TO KNOW what happened before (I ignore backreferences here) – on the other hand, manual, self-implemented state machine approach allows you to record what came before and act upon it.

Anything that breaks a typical regex parsing pattern typically needs to be parsed with something more complex. Something with a MEMORY. State machine is just one of the algorithms you can use, but there are many others. In some case machine learning could help, in others you may need to use parsing that is walking through many phases, recursion, a few iterative rounds, etc.

I am not a data scientist, but anytime I write a successful state machine, I certainly feel like one. An ability to use some quick & dirty ad hoc programming to deal with unstructured, or less predictable data is a great asset in your toolkit…