File Formats ZOO | Hexacorn

This series talks about ‘good’ files. That is, files (samples) produced by reputable vendors, often signed, and hopefully not compromised by stolen certificates, vulnerabilities, supply-chain attacks or bothered by other err… minor inconveniences :-).

Say you have amassed a bunch of ‘good files and declare your first ‘goodware’ collection as ‘ready for processing’.

What do you do with it?

The easiest way to start processing this sampleset is by applying advanced file typing first. We want to know what files we collected, and mind you, I mean not only a distinction between your random media file (jpeg, gif) and PE/ELF/MACH-O, but also binary vs. scripting, 32- vs 64-bit, EXE vs DLL, user mode vs. kernelmode, standalone exe vs. .NET, signed vs unsigned, standalone vs installer, installer/packer/protector types: autoit, pyinstaller, perl2exe, legacy PE file protection layers (mpress, pecompact, themida, etc.), MSI, and gazillion of existing installer types that typically store the installation information as a compressed/encrypted appended data behind the generic executable installer stub (Nullsoft, InnoSetup, Wise, etc.).

To perform this task I use a combination of DiE and my own spaghetti-code script that I have been improving over last 17 years (sorry, not for sharing, it’s absolutely disgusting!).

Once we know what we are dealing with we can try to unpack stuff.

Why unpack you may ask?

Because what you download from vendors’ sites is often installers that internally store many additional files, many of which are OF INTEREST. Yup, additional embedded installers, standalone EXE/DLL/OCX/SYS, redistributables, etc.

How to do it?

The 7-zip is a natural candidate, but we need to be careful. I suspect NSRL analysts use it extensively and they unpack everything that has ‘a binary pulse’ recursively until 7z returns error. As a result, they get tones of executable file sections’ metadata sneaking in to their final hash set, and… in the end no one wins.

The other natural candidate is Universal Extractor (and the updated versions of it f.ex. Universal Extractor 2). You can’t win this battle either, because it’s too complex and you lose control of what goes unpacked and what ends up in your final metadata set, same as with an overzealous use of 7z.

We should definitely use these tools though, just need to apply some moderation.

For instance, we can disable extraction of PE files internals by 7-zip by using its stx command line argument:

7z x -stxPE foo.exe

With that, you could run this recursively on all files (including traversing unpacked directories) and get a decent list of internal files, but without unpacking PE files!

For Universal Extractor-based tools their best use is … analysis of their code. You will find info on both the syntax and required toolkit information that helps to unpack less common archive formats. It’s the best way to study their handling of particular file formats/installers and how to unpack them so that we can then cherry-pick these that work for us. Again, we don’t want all, as you will end up unpacking lots of useless information e.g. .MSI files and generating a lot of poor quality metadata. Be choosy.

For InnoSetup there is InnoUnp.

For Nullsoft, there is a 7z version 15.05 that extracts Nullsoft installer files very neatly, including the [NSIS].nsi file that is a decently reproduced Nullsoft Installation Script!

For AutoIT executables you can use ClamAV as pointed out by @SmugYeti.

Yup, you have to analyze your advanced file typing results first and then … divide and conquer.

So…

Imagine you have file typed all the samples, you know how to unpack them, what’s next?

I think there are two ways to go about the next step – it starts with a script picking up a single file from your repository, and:

recursively unpacking it and its subsequent ‘descendant files’, advance file-type them and at the end copy files of interest (PE, MSI, etc.) back to repository
unpack first layer only, then copy files of interest back to repository for further processing

I think both approaches have advantages, with the first one being probably the ‘smartest’ (i.e. do it once, well), and the second more optimized for resources usage. Why? For source files being often 200MB in size and more (yes, there are plenty of such installers nowadays!) the whole ‘extracted files & directories’ tree may end up being a good few, even few dozens of GB of data! And it simply implies a necessity of using the slow HDD as ‘a working space’.

A side note #1 here: I am talking of small SOHO hardware investments here! Note #2, I also don’t advocate using SSD in your SOHO ‘sample processing’ setup as they tend to fail after too many I/O operations. Had too many issues in the past and don’t recommend.

In the second approach, RAM disk may be good enough most of the time and it’s definitely better, performance-wise.

Choose your poison wisely.

I actually use a hybrid approach – if the file is relatively small I try to unpack it fully on the RAM drive, and if it is an obvious ‘fat installer’ I push it to HDD. And I always try to do ‘a full-recursive’. Saves time and is kinda neat. Again, your personal choice matters here.

So… now you extracted, distributed and catalogued all this goodness.

What’s next?

For starters, keep all logs so you can troubleshoot issues, and then… this is a series, so there will be another post 🙂

I like extracting data from many samples because this way I often discover new things. Combing through a set of manifest files I have extracted from a large sampleset of good samples was an interesting exercise and brought a few interesting findings.

Manifest files I came across were saved as plain text, Unicode 16 LE, and utf8. Some were malformed, some used incorrect data, others included commented out manifest sections and sometimes the commented out parts would use HTML entities to represent opening and closing brackets. Quotation marks vs. apostrophes, boilerplate values (e.g. name = “CompanyName.ProductName.YourApplication”, name = “YourCompanyName.YourDivision.YourApp”, etc.), and typos (e.g. “schema-microsoft-com:asm.v3”, or “urn:schemas-microsoft.com:asm.v3”).

I tried to see if I can find any publicKeyToken outliers — these are often used to reference a specific library version – the most popular being comctl32.dll v6.0 enabling visual styles back in the days when it still mattered (publicKeyToken=”6595b64144ccf1df”).

Quick histogram of publicKeyToken values shows a small number of unique values, some of which are kinda questionable (e.g. empty, zeroed, or using a reference):

publicKeyToken="6595b64144ccf1df"
publicKeyToken="1fc8b3b9a1e18e3b"
publicKeyToken="000000000000000"
publicKeyToken="02ad33b422233ae3"
publicKeyToken="73A0BB510A53FB51"
publicKeyToken="31BF3856AD364E35"
publicKeyToken="0000000000000000"
publicKeyToken="dfbe2673baf698eb"
publicKeyToken="6595B64144CCF1DF"
publicKeyToken="89845dcd8080cc91"
publicKeyToken="13acf979d16e8a17"
publicKeyToken="b03f5f7f11d50a3a"
publicKeyToken="B03F5F7F11D50A3A"
publicKeyToken="$(Build.WindowsPublicKeyToken)"
publicKeyToken="5a496c7842cd4787"
publicKeyToken="296da4bedbebef8f"
publicKeyToken="df38d5d136a3092e"
publicKeyToken=""
publicKeyToken="fcc99ee6193ebbca"
publicKeyToken="b77a5c561934e089"
publicKeyToken="81e233547d425e6b"
publicKeyToken="6bd6b9abf345378f"
publicKeyToken="C7153A0601FA8C89"
publicKeyToken="7a259a25b8d448e5"
publicKeyToken="654bb64156ccf1af"
publicKeyToken="40C4B6FC221F4138"
publicKeyToken="31bf3856ad364e35"
publicKeyToken="1fc8b3b9a1e18e3c"
publicKeyToken="02d1dcd786c7c243"
publicKeyToken="f92d94485545da78"
publicKeyToken="a03853097df2bf0c"
publicKeyToken="A2625990D5DC0167"
publicKeyToken="71E9BCE111E9429C"
publicKeyToken="669E0DDF0BB1AA2A"
publicKeyToken="5120E14C03D0593C"
publicKeyToken="47D0C84D0EBB13E5"
publicKeyToken="4267b751a96a28a1"
publicKeyToken="30AD4FE6B2A6AEED"

Another statistic I was interested in was requestedExecutionLevel, but it didn’t bring anything interesting:

level="asInvoker"
level="highestAvailable"
level="leastPrivilege"
level="requireAdministrator"

Looking at processorArchitecture we get:

$(build.processorArchitecture)
*
AMD64
Amd64
IA64
MSIL
SXS_PROCESSOR_ARCHITECTURE
X64
X86
amd64
arm
ia64
msil
x64
x86

For uiAccess:

&quot;false&quot;
FALSE
False
TRUE
True
false
true
true|false

Another target of these analysis were URIs. These constantly pop up during memdump analysis and knowing a list of clean ones can save us some time. Here’s a list I extracted (including these prefixed with ‘urn’):

http://blogs.msdn.com/b/chuckw/archive/2013/09/10/manifest-madness.aspx
http://ipmsg.org/tools/fastcopy.html
http://ltsc.ieee.org/xsd/LOM
http://manifests.microsoft.com/win/2004/08/windows/events
http://mozilla.org/MPL/2.0/.
http://msdn.microsoft.com/en-us/library/aa374191
http://msdn.microsoft.com/en-us/library/aa374191(VS.85).aspx
http://msdn.microsoft.com/en-us/library/aa965884%28v=vs.85%29.aspx
http://msdn.microsoft.com/en-us/library/dd371711
http://msdn.microsoft.com/en-us/library/hh848036
http://msdn.microsoft.com/en-us/library/hh848036(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/ms633543.aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/dn302074%28v=vs.85%29.aspx
http://msdn.microsoft.com/windowsvista/prodinfo/what/security/default.aspx?pull=/library/en-us/dnlong/html/AccProtVista.asp
http://opensource.org/licenses/cpl.php
http://opensource.org/licenses/cpl1.0.php
http://schemas.microsoft.com/SMI/2005/WindowsSettings
http://schemas.microsoft.com/SMI/2010/WindowsSettings
http://schemas.microsoft.com/SMI/2011/WindowsSettings
http://schemas.microsoft.com/SMI/2016/WindowsSettings
http://schemas.microsoft.com/SMI/2017/WindowsSettings
http://schemas.microsoft.com/win/2004/08/events
http://social.msdn.microsoft.com/Forums/en/winformssetup/thread/7787c8b9-18c3-4135-bd8a-2802eba98e3c
http://www.adlnet.org/xsd/adlcp_v1p3
http://www.apache.org/licenses/LICENSE-2.0
http://www.imsglobal.org/xsd/imscp_v1p1
http://www.w3.org/2000/09/xmldsig#
http://www.w3.org/2000/09/xmldsig#sha1
http://www.w3.org/2001/XMLSchema
http://www.w3.org/2001/XMLSchema-instance
http://yourserver/iis_auth.asp?debug=1
urn:0073chemas-microsoft-com:asm.v3
urn:schemas-microsoft-com:asm.v1
urn:schemas-microsoft-com:asm.v2
urn:schemas-microsoft-com:asm.v3
urn:schemas-microsoft-com:clickonce.v1
urn:schemas-microsoft-com:clickonce.v2
urn:schemas-microsoft-com:compatability.v1
urn:schemas-microsoft-com:HashTransforms.Identity
urn:schemas-microsoft-com:HashTransforms.ManifestInvariant

Finally, attributes (note, some may only exist within comments, that is, between <!–…-> not the actual manifest XML):

name
iid
version
clsid
progid
hash
description
proxyStubClsid32
tlbid
Id
numMethods
publicKeyToken
task
message
language
value
xmlns
processorArchitecture
uiAccess
level
type
class
file
standalone
inType
encoding
mask
flags
manifestVersion
threadingModel
keywords
size
chid
runtimeVersion
guid
xmlns:asmv3
company
optional
outType
helpdir
xmlns:co.v2
copyright
allowDelayedBinding
opcode
xmlns:asmv2
length
xmlns:ms_asmv3
buildType
hashalg
parameters
xmlns:adlcp
xsi:schemaLocation
xmlns:cmp
culture
xmlns:ms_asmv1
profile
xmlns:ms_windowsSettings
xmlns:xsi
baseInterface
majorVersion
face
xmlns:xsd
miscStatusContent
resourceFileName
xmlns:asmv1
isolation
dependencyType
servicePackMajor
xmlns:co.v1
channel
xmlns:lom
assemblyname
xmlns:ms_asmv2
messageFileName
xmlns:ms_compatibility
template
xmlns:mssv2
minorVersion
miscStatus
enabled
asmv2:product
product

And last, but not least… this classic paper [PDF warning] from 2006 on manifest file abuse was yet another reason I looked at manifest files en masse. I speculated that maybe, maybe, maybe, maybe there are some signed executables that take advantage of manifest’ file tag as described in the document:

and inadvertently may become a vehicle for a ‘by design’ manifest-based DLL side-loading. The scenario would play like this: you run a signed executable that uses a manifest leveraging the file tag and you provide it the malicious DLL named as the manifest expects and place it in a current directory. Should work?

After grepping the manifest files for <file tag I found quite a few of them. So many that I can’t paste it here. But you can view them here.

What’s next? Obviously, more research.

Hexacorn

Hexacorn

Category Archives: File Formats ZOO

Good file… (What is it good for) Part 2

Re-sauce, Part 3