When you take a look at large corpora of appended data
— the data that is a part of many PE files, but is not loaded as a part of PE image loading into memory (when a program starts) — patterns emerge.
For malware, this usually means an abuse of a popular installer.
For goodware, it’s a business as usual.
Using the state machine script I discussed in my other post today, I extracted 4 top hexadecimal values from the appended data of many goodware installers.
There are no surprises there — many of appended data blobs are typically in a format utilized by popular and ‘genuine’ installer packages (stub+appended data):
181472 00 00 00 00
131876 4D 53 43 46 - CAB file
36369 2E 66 69 6C - .file
36359 7A 6C 62 1A - Inno Setup
31960 13 00 00 00
27981 3B 21 40 49 - 7z SFX
24883 50 4B 03 04 - Zip
21721 40 55 41 46 - AMI Flash Utility
13896 01 00 00 00
9489 A3 61 4A 6A
9470 5C 73 65 6C - \self\bin\x86\msvcp60.pdb.
8021 52 61 72 21 - Rar!
7077 0E 00 00 00
6855 5F 45 4E 5F - _EN_CODE.BIN
There is an appended that is a CAB, ZIP, RAR file, as well as some proprietary appended data file formats as well.
How can we utilize it from a detection perspective?
Some of them that are not popular among malware samples could become exclusions.
Outliers are a perfect test bed for any PE parser testing. Yes… Does your parser parse every PE file structures properly? While analyzing data for this blog post I have spotted many badly parsed PE files. This is quite a slap in my face. My parser has grown organically over many years and I was quite confident that it ‘handles’ many outliers. I know now that I have to improve it. A humble lesson for any sample collector really…
Finally, knowing what types of installers are being used by a goodware, you can use it as a hint on how to craft your red team tools not to stand out. It may sound silly, but if ‘next gen’/AI/ML algos really exist and they train on a crazily large corpora of samples… chances are that they will learn to ignore many of these popular file setups…