Extracting and Parsing PE signatures en masse

A few years back I was dealing with a large corpora of PE files, and many of them were PUA/Adware installers. Most of these were signed, so I thought it would be cool to automate writing yara sigs based on these PE signatures. So I did, and it helped me a lot with dividing the whole sampleset into clusters. I could then just exclude (a.k.a. delete) the uninteresting clusters of installers, and remove them from a scope of my further analysis.

Today someone reminded me of this project, and I thought I will jot down some notes + share the yara sig I generated at that time. I believe in automation a lot, and hope this will be useful to someone facing similar problems.

To extract signatures from a PE file, one can use the disitool.py from Didier Stevens. Once we extract it, we can analyze it. The problem is that:

  • the extracted signature is in a binary form
  • parsing it is non-trivial, so we need to use existing tools to do so for us

After googling around, I eventually learned how to do it & wrote a simple batch file that I delegated this unpleasant task to. The batch file takes a name of a PE file from a command line, and extracts the binary signature using disitool.py, and then parses it… in 3 different ways.

This is the batch file:

disitool.py extract "%1" "%1.cert"
if exist "%1.cert" (
openssl asn1parse -inform DER -i -in "%1.cert" > "%1.cert.asn"
openssl pkcs7 -inform DER -in "%1.cert" -text -print_certs > "%1.cert.asn2"
certutil -asn "%1.cert" > "%1.cert.asn3"
)

You may notice that I am using both openssl / certutil. Why double, or even triple the effort? This is because I discovered that relying on data extracted by only one tool was not enough. To be frank, I don’t know the intricate details of what is exactly stored inside the actual Authenticode signature, and how. The ASN format is not a pillow read either, hence I went with a ROI-driven approach and simply extracted the data in any possible way and format.

With that, I ran it over a corpora of samples. I then used a quick & dirty parser I wrote for the data outputted by these two tools, and generated a yara sig that covered most of the installers in the corpora.

You can download the Yara Sig file here. Note, I saved it as Unicode, so you can see localization issues one needs to take into account while parsing sigs.

Feel free to use it, but only on your own risk. I don’t guarantee that it’s error free. Also, if you are listed in the sig file, it’s only for purposes of samples’ clustering.

SQM Process Hashes

Today I came across Registry entries that I have not seen being documented anywhere before, so decided to throw a quick & dirty post about it.

One of the less known/understood components of Windows is SQM. SQM stands for “Software Quality Metrics” and I don’t know really more than what I have read from the linked articles, plus general opinions online that this is a part of MS spying machine, so pardon my ignorance.

Today, I was looking at artifacts created by various processes and spotted this intriguing entry:

  • HKLM\Software\Microsoft\SQMClient\Windows\DisabledProcesses\<some hash-like looking value>

Knowing that Windows programmers love hashes, I was curious what this entry is for, and obviously, how to calculate the hash it refers to.

A quick test followed for a couple of popular programs, and I got these results:

Now that I had a few test values, I looked at the code of ntdll.dll (where I eventually traced the code responsible for these callouts to), and quickly discovered the routine. The hash type used here is known as UHash (I googled the constants used by the algorithm, and this is the name of the function that I found).

It basically takes the filename of the process (anything that follows the last directory separator), then iterates through it starting from its end (from a file extension), and then each character is upper-cased (Unicode!), and then added to the UHash formula.

You can see the full algo in a script here.

When ran with example process names as in the screenshot above, we get these values:

  • 494A65DD – powershell.exe
  • 4DA42CDB – calc.exe
  • DA0C75C2 – cscript.exe

The more troubling question is the meaning of it all. This, I frankly don’t know. There are a couple other keys associated with SQM in the same Registry branch e.g. DisabledSessions (under the same node). Googlign around and digging in the ntdll.dll shows that SQM seems to be dependent on Customer Experience settings i.e. CEIPEnable entry described here:

So, I guess the DisabledProcesses / DisabledSession entries could be flags that remove _some_ processes from active SQM monitoring (in a more granular way). And all in all, something that we probably want to completely disable via a higher-level CEIPEnable value, and others in the same location e.g.: