In a recent post, I introduced a new tool – hstrings. Its purpose is to find strings of any sort, not only ANSI (ASCII really) and a Basic Latin subset of Unicode, but many encoding variants as well. Today I am releasing a first version of the tool and in this post I will provide more information about currently available options and modes of operations.
First of all, I encourage you to read Microsoft’s page listing Code Page Identifiers (Windows) – this is a list that I used as a foundation for hstrings; the tool goes a bit further and splits these into multiple families and also tries to split Unicode sets into more manageable chunks, yet Code Page Identifiers are the best starting point to choose what strings one wants to search.
The tool works in multiple modes and requires a few options that will decide how the input is processed and how the output is generated, plus what encoding are included in the search.
Let’s see a few examples first…
Character Set recognition
Imagine you have a file that is encoded, but you are not sure what character set is being used for encoding and you have no clue what language it may be at all.
The approach one may take to find out more about the file encoding is… a simple brute force which means checking all possible encodings and trying to convert only a small chunk of bytes from the input file to see what happens.
This is how ‘probing’ option mode works in hstrings. Once you select the option, the tool will read 32 bytes of the input file and try to decode it using all the chosen encodings and send it to the standard output or to separate files (depends on output options discussed later).
In the previous article I presented a sample Russian text encoded with various encodings.
If we try to run the hstring over one of these files
hstrings -qpsC test\russian_u16be.txt > out
we will get the following output:
As we can see, the longest meaningful string was produced by Unicode Cyrillic. Indeed, the file name contains suffix ‘u16be’ which is how I named the sample file encoded with a 16-bit Unicode Big Endian encoding.
We can then try running the same command on the data saved with a different encoding:
hstrings -qpsC test\russian_utf8.txt > out
Of course, this time we are not lucky as the ‘C’ option we used only applies Cyrillic encodings (see option details at the bottom of the post), and the result shows that none of them succeeded:
We can extend the list – and since it’s just an example we can be greedy – by using all encodings (option ‘0’)
hstrings -qps0 test\russian_utf8.txt > out
Browsing through results we can see that this time we got the UTF-8 encoding giving quite a good output
Indeed, my naming convention reveals that it is a Russian text saved using UTF8 encoding.
Certainly, what helps in character set recognition is at least basic knowledge on how texts in various languages look like; anyone who saw Russian text previously shouldn’t have a problem picking up the correct output (encoding) presented in this example, but if you have never seen Cyrillic text before, this can be quite challenging. One way of improving the algorithm I have in mind is adding some wordlists to additionally recognize the known words in a specific language.
Extracting all strings
One aspect of the character set recognition is the actual detection of the matching encoding, now one can simply extract all strings in this encoding from the whole file. You can do it by replacing ‘p’ (probing character set) with ‘d’ (dump strings).
Since we now know that the last file has been encoded with UTF8, we can extract all strings using ‘8’ options which means UTF8:
hstrings -qds8 test\russian_utf8.txt > out
The output looks like this:
Due to a number of encodings supported by hstrings, at the moment there is no possibility of specifying a single character set, except for very popular ones and this includes UTF8; I may add option for specific code pages/encodings if there is a demand.
Let’s walk through them one by one
- GENERAL OPTIONS:
- – q – quiet (no banner) – basically no copyright information
- INPUT OPTIONS – dictate whether we read the whole input file or just first 32 bytes
- – p – probe first 32 bytes of a file
- – d – dump strings from the whole file
- OUTPUT OPTIONS –provide a choice to save the output in a single file (standard output one can redirect to a file), or multiple files (in such cse file names will have a ‘h_’ prefix and a code page as a name
- – s – dump strings to standard output (use pipe to save to file)
- – m – dump strings to multiple files (one encoding=one file)
- ENCODINGS – these are grouped by families
- – 0 – All supported encodings
- – 1 – All Windows ANSI, UTF8, ASCII subset of Uni-LE/Uni-BE
- – 2 – All Windows ANSI encodings
- – 7 – UTF7
- – 8 – UTF8
- – U – Unicode encodings (except utf8/utf7)
- – I – All IBM encodings
- – E – IBM EBCDIC encodings (subset of I)
- – M – MAC encodings
- – A – Arabic encodings
- – C – Cyrillic encodings
- – H – Hebrew encodings
- – J – Japanese encodings
- – K – Korean encodings
- – Z – Chinese encodings
This is an experimental tool and it is far from a final – I am personally aware of a few bugs and imperfections that I need to address (e.g. Unicode maps are far from perfect and sometimes produce too much output; generally too much output is still an issue), but if you want to test it feel free and I will appreciate any feedback. Thanks!
You can download the tool here.