Hexacorn

In a recent post, I introduced a new tool – hstrings. Its purpose is to find strings of any sort, not only ANSI (ASCII really) and a Basic Latin subset of Unicode, but many encoding variants as well. Today I am releasing a first version of the tool and in this post I will provide more information about currently available options and modes of operations.

First of all, I encourage you to read Microsoft’s page listing Code Page Identifiers (Windows) – this is a list that I used as a foundation for hstrings; the tool goes a bit further and splits these into multiple families and also tries to split Unicode sets into more manageable chunks, yet Code Page Identifiers are the best starting point to choose what strings one wants to search.

The tool works in multiple modes and requires a few options that will decide how the input is processed and how the output is generated, plus what encoding are included in the search.

Let’s see a few examples first…

Character Set recognition

Imagine you have a file that is encoded, but you are not sure what character set is being used for encoding and you have no clue what language it may be at all.

The approach one may take to find out more about the file encoding is… a simple brute force which means checking all possible encodings and trying to convert only a small chunk of bytes from the input file to see what happens.

This is how ‘probing’ option mode works in hstrings. Once you select the option, the tool will read 32 bytes of the input file and try to decode it using all the chosen encodings and send it to the standard output or to separate files (depends on output options discussed later).

In the previous article I presented a sample Russian text encoded with various encodings.

If we try to run the hstring over one of these files

hstrings -qpsC test\russian_u16be.txt > out

we will get the following output:

As we can see, the longest meaningful string was produced by Unicode Cyrillic. Indeed, the file name contains suffix ‘u16be’ which is how I named the sample file encoded with a 16-bit Unicode Big Endian encoding.

We can then try running the same command on the data saved with a different encoding:

hstrings -qpsC test\russian_utf8.txt > out

Of course, this time we are not lucky as the ‘C’ option we used only applies Cyrillic encodings (see option details at the bottom of the post), and the result shows that none of them succeeded:

We can extend the list – and since it’s just an example we can be greedy – by using all encodings (option ‘0’)

hstrings -qps0 test\russian_utf8.txt > out

Browsing through results we can see that this time we got the UTF-8 encoding giving quite a good output

Indeed, my naming convention reveals that it is a Russian text saved using UTF8 encoding.

Certainly, what helps in character set recognition is at least basic knowledge on how texts in various languages look like; anyone who saw Russian text previously shouldn’t have a problem picking up the correct output (encoding) presented in this example, but if you have never seen Cyrillic text before, this can be quite challenging. One way of improving the algorithm I have in mind is adding some wordlists to additionally recognize the known words in a specific language.

Extracting all strings

One aspect of the character set recognition is the actual detection of the matching encoding, now one can simply extract all strings in this encoding from the whole file. You can do it by replacing ‘p’ (probing character set) with ‘d’ (dump strings).

Since we now know that the last file has been encoded with UTF8, we can extract all strings using ‘8’ options which means UTF8:

hstrings -qds8 test\russian_utf8.txt > out

The output looks like this:

Due to a number of encodings supported by hstrings, at the moment there is no possibility of specifying a single character set, except for very popular ones and this includes UTF8; I may add option for specific code pages/encodings if there is a demand.

OPTIONS

Let’s walk through them one by one

GENERAL OPTIONS:
- – q – quiet (no banner) – basically no copyright information
INPUT OPTIONS – dictate whether we read the whole input file or just first 32 bytes
- – p – probe first 32 bytes of a file
- – d – dump strings from the whole file
OUTPUT OPTIONS –provide a choice to save the output in a single file (standard output one can redirect to a file), or multiple files (in such cse file names will have a ‘h_’ prefix and a code page as a name
- – s – dump strings to standard output (use pipe to save to file)
- – m – dump strings to multiple files (one encoding=one file)
ENCODINGS – these are grouped by families
- – 0 – All supported encodings
- – 1 – All Windows ANSI, UTF8, ASCII subset of Uni-LE/Uni-BE
- – 2 – All Windows ANSI encodings
- – 7 – UTF7
- – 8 – UTF8
- – U – Unicode encodings (except utf8/utf7)
- – I – All IBM encodings
- – E – IBM EBCDIC encodings (subset of I)
- – M – MAC encodings
- – A – Arabic encodings
- – C – Cyrillic encodings
- – H – Hebrew encodings
- – J – Japanese encodings
- – K – Korean encodings
- – Z – Chinese encodings

Final word

This is an experimental tool and it is far from a final – I am personally aware of a few bugs and imperfections that I need to address (e.g. Unicode maps are far from perfect and sometimes produce too much output; generally too much output is still an issue), but if you want to test it feel free and I will appreciate any feedback. Thanks!

Download

You can download the tool here.

Over last few months I have been publishing various stats pulled out of malware collection that I am batch analyzing. The purpose of analysis is not only just getting interesting numbers and utilizing it as a nice filler for the blog 🙂 – all this data is being retrieved with a purpose of enhancing HexDive and for my other projects.Until now, I have been presenting data from a superset of all malicious PE files in a collection. It crossed my mind recently that it would be interesting to focus on a subset of PE files as well and for starters I picked up kernel drivers.

Getting all strings and then cherrypicking up system functions out of the samples is relatively quick as there is not so many of them – the result of top 100 most popular APIs sorted by number of occurrence is presented below:

18431    RtlInitUnicodeString
16625    IofCompleteRequest
16214    ExAllocatePoolWithTag
14783    ZwClose
12899    MmGetSystemRoutineAddress
12002    ZwOpenKey
11911    ObfDereferenceObject
11719    IoCreateDevice
11430    IoGetCurrentProcess
11411    ExFreePool
11395    IoDeleteDevice
11198    RtlAnsiStringToUnicodeString
10969    ZwCreateFile
10895    wcslen
10848    strncmp
10672    strncpy
10585    wcscpy
10195    IoCreateSymbolicLink
10141    swprintf
9957    wcscat
9899    PsCreateSystemThread
9495    MmIsAddressValid
9466    ZwSetValueKey
9112    PsLookupProcessByProcessId
9106    ObReferenceObjectByHandle
8971    PsGetVersion
8630    ZwCreateKey
8600    RtlCopyUnicodeString
8334    KeDelayExecutionThread
7925    RtlCompareUnicodeString
7886    wcsncpy
7861    ZwQueryValueKey
7525    KeTickCount
7135    KeQuerySystemTime
7052    IoRegisterDriverReinitialization
6674    PsSetCreateProcessNotifyRoutine
5968    ExFreePoolWithTag
5671    ZwEnumerateKey
5427    ZwQuerySystemInformation
5414    ZwSetInformationFile
5249    ZwDeleteKey
5072    wcsstr
5017    KeWaitForSingleObject
4922    ZwCreateSection
4855    ZwMapViewOfSection
4757    IoDeleteSymbolicLink
4747    PsTerminateSystemThread
4708    wcschr
4605    wcsrchr
4540    KeServiceDescriptorTable
4226    KeQueryTimeIncrement
4218    ZwUnmapViewOfSection
4070    IoDeviceObjectType
3941    ZwReadFile
3740    KeInitializeEvent
3706    KeInitializeTimer
3562    ObQueryNameString
3538    ZwWriteFile
3522    KeSetEvent
3495    DbgPrint
3470    KeGetCurrentIrql
3381    KeBugCheckEx
3313    ZwQueryInformationFile
3286    ZwOpenFile
3232    IoFreeMdl
3171    RtlInitAnsiString
3043    memcpy
3037    IofCallDriver
2897    memset
2892    RtlFreeUnicodeString
2870    IoAllocateMdl
2629    MmProbeAndLockPages
2461    MmUnlockPages
2349    RtlUnicodeStringToAnsiString
2340    ZwAllocateVirtualMemory
2291    IoFreeIrp
2265    MmMapLockedPagesSpecifyCache
2144    KeGetCurrentThread
2134    KfReleaseSpinLock
2090    RtlFreeAnsiString
2031    KeStackAttachProcess
2025    KfRaiseIrql
2022    KfLowerIrql
1997    IoAllocateIrp
1997    ExAllocatePool
1994    RtlCompareMemory
1967    ExGetPreviousMode
1930    RtlTimeToTimeFields
1918    sprintf
1896    KeUnstackDetachProcess
1884    KfAcquireSpinLock
1870    ZwOpenProcess
1808    PsGetCurrentProcessId
1795    KeReleaseMutex
1747    RtlAppendUnicodeToString
1746    KeInitializeSpinLock
1740    IoCreateFile
1729    ProbeForRead
1727    KeClearEvent
1713    RtlUnwind