You are browsing the archive for 2012 November.

Hiding env./tools from malware a.k.a. fight fire with fire (but only inside VM)

November 25, 2012 in Malware Analysis

Seasoned malware analysts/reversers/crackers move along – you already know this stuff :-)

Analyzing malware is always challenging as there are a few dozen if not hundreds different ways to detect the virtual environment plus other tools used by reversers during dynamic or in-depth analysis – most of these can be easily picked up by malware looking for process names, registry keys, or using one of the undocumented, or semi-documented bugs/features of VMs (usually snippets of code producing different results when executed on a real CPU vs. on a virtual CPU).

This short post describes a few ways how to hide VM (main focus on VMWare) and tools – by hiding their files, processes, services + associated with them registry keys/values.

Changing VM settings

It has has been described quite well here.

Hiding Processes only

If you need to hide the process only, you can use HideToolz available for a download from Fyyre’s web site.

When the HideToolz is active, the processes marked for hiding are not visible in a Task Manager and can’t be found by normal process enumeration functions.

This is what HideToolz sees (processes marked with an asterisk are hidden)

 

And this is what Task Manager can see (Process Explorer as well)

Hiding Files, Folders, Processes, Services, Registry entries

When it comes to hiding more stuff, one can use help from the good ol’ Hacker Defender rootkit by HolyFather.

The rootkit uses a configuration file that allows to specify what we want hidden in the environment and that includes:

  • files/folders
  • processes
  • services and their associated registry entries
  • registry keys/names/values

To set up the Hacker Defender one needs to  edit/change the default configuration file into sth along these lines:

[Hidden Table]
hxd*
vmu*
vmt*
vmw*
tools*
procexp*
ollydbg*

[Root Processes]
hxd*
vmu*
vmt*
vmw*
tools*
procexp*
ollydbg*

[Hidden Services]
HackerDefender100
vmu*
vmt*
vmw*
procexp*

[Hidden RegKeys]
VMware, Inc.
Sysinternals

[Hidden RegValues]
vmu*
vmt*
vmw*

[Startup Run]

[Free Space]

[Hidden Ports]

[Settings]
Password=infected
BackdoorShell=cmd.exe
FileMappingName=_.-=[Hacker Defender]=-._
ServiceName=HackerDefender100
ServiceDisplayName=HXD Service 100
ServiceDescription=NT rootkit
DriverName=HackerDefenderDrv100
DriverFileName=hxdefdrv.sys

[Comments]

The new configuration file can be now loaded:

hxdef100.exe hide.ini

And from now on browsing the folders, files, registry keys, names, values and processes, services lists will be available only to processes listed in  ‘root processes’ section.

Example: what Regedit sees before installing the rootkit:

and after its installation

What Task Manager sees before

and after rootkit installation

 

Obviously, the configuration I provided above is far from being perfect. The VM-specific strings are all over the place inside the registry, so we need to do a bit more of a home work. It is also more than likely that your environment uses different paths and tools.

It would be ideal if VM product developers allowed to completely hide the tools and the environment from the guest OS by e.g. using simple randomization of names, windows titles, processes’ names etc. -  a simple technique used for years by many antirootkit tools e.g. XUETR and GMER.

Top 100+ malicious types of 32-bit PE files

November 19, 2012 in Batch Analysis, Malware Analysis

Another round of stats – this time the top 100+ most ‘popular’ PE i386 file formats used by malware from over 1.2M samples.

Legend:

  • MZ PE i386 = PE 32 bit
  • DLL = DLL :)
  • Corrupted or Tricky = for some reason parser failed (usually some PE file tricks)
  • APPDATA xxxxxxxx = appended data followed by first 1-4 characters
  • SIG = contains directory entry pointing to signature (often it’s a random garbage though, not stolen certificates)
  • DEB = contains debugging information
  • COM = COM library
  • .NET = .NET PE
  • and lots of names related to various installers
 (44.17%)    560067    MZ PE i386
  (6.59%)     83554    MZ PE i386 DLL
  (6.16%)     78149    MZ PE i386 Corrupted Tricky
  (4.84%)     61379    MZ PE i386 DEB
  (3.51%)     44529    MZ PE i386 APPDATA 00000000
  (2.99%)     37871    MZ PE i386 SIG
  (2.81%)     35644    MZ PE i386 Tricky
  (2.01%)     25462    MZ PE i386 DLL COM
  (1.30%)     16478    MZ PE i386 NullSoft 2.46-1 SIG
  (1.28%)     16253    MZ PE i386 DLL DEB
  (1.28%)     16220    MZ PE i386 .NET
  (1.04%)     13128    MZ PE i386 SYS
  (0.98%)     12459    MZ PE i386 Tricky SIG
  (0.92%)     11614    MZ PE i386 NullSoft Unknown
  (0.82%)     10393    MZ PE i386 InnoSetup
  (0.78%)      9831    MZ PE i386  AutoIt or AutoHotKey
  (0.77%)      9709    MZ PE i386 Corrupted Tricky DEB
  (0.65%)      8273    MZ PE i386 .NET APPDATA 00000000
  (0.65%)      8217    MZ PE i386 DEB SIG
  (0.64%)      8166    MZ PE i386 NullSoft 2.46
  (0.61%)      7757    MZ PE i386 DLL APPDATA 00000000
  (0.54%)      6881    MZ PE i386 .NET DEB
  (0.48%)      6131    MZ PE i386 Zip Sfx
  (0.48%)      6054    MZ PE i386 Tricky DEB
  (0.47%)      5938    MZ PE i386 Rar SFX
  (0.46%)      5891    MZ PE i386 NullSoft 2.45
  (0.46%)      5836    MZ PE i386 APPDATA B80E0000
  (0.44%)      5631    MZ PE i386 DLL Corrupted Tricky
  (0.42%)      5318    MZ PE i386 Appended MZ
  (0.42%)      5312    MZ PE i386 APPDATA 01000000
  (0.42%)      5279    MZ PE i386 InstallAware
  (0.41%)      5232    MZ PE i386 Tricky DEB SIG
  (0.40%)      5074    MZ PE i386 NullSoft 2.27
  (0.37%)      4733    MZ PE i386 Trymedia
  (0.36%)      4549    MZ PE i386 APPDATA 00000000 DEB
  (0.36%)      4546    MZ PE i386 APPDATA 3C706172
  (0.34%)      4336    MZ PE i386 SYS DEB
  (0.33%)      4161    MZ PE i386 APPDATA A5B79A82
  (0.29%)      3690    MZ PE i386 NullSoft 2.46 SIG
  (0.23%)      2973    MZ PE i386 Trymedia SIG
  (0.23%)      2925    MZ PE i386 APPDATA 88110000
  (0.23%)      2918    MZ PE i386 .file
  (0.22%)      2799    MZ PE i386 Rar SFX DEB
  (0.22%)      2728    MZ PE i386 APPDATA B00E0000
  (0.19%)      2440    MZ PE i386 .NET Tricky
  (0.19%)      2422    MZ PE i386 DLL Tricky
  (0.19%)      2405    MZ PE i386 APPDATA 31353835
  (0.18%)      2255    MZ PE i386 DLL COM APPDATA 00000000
  (0.18%)      2234    MZ PE i386 APPDATA 56566245
  (0.17%)      2206    MZ PE i386 NullSoft 2.46-5 SIG
  (0.16%)      2078    MZ PE i386 APPDATA 08080000
  (0.16%)      2036    MZ PE i386 DLL COM DEB
  (0.16%)      1990    MZ PE i386 .NET DLL DEB
  (0.14%)      1750    MZ PE i386 APPDATA 001F0023
  (0.14%)      1750    MZ PE i386 APPDATA 5B424547 SIG
  (0.13%)      1706    MZ PE i386 DLL SIG
  (0.13%)      1678    MZ PE i386 NullSoft 2.24
  (0.13%)      1633    MZ PE i386 NullSoft 2.44
  (0.13%)      1597    MZ PE i386 DLL APPDATA 928F8C89
  (0.13%)      1585    MZ PE i386 Wise
  (0.12%)      1582    MZ PE i386 DEB
  (0.12%)      1576    MZ PE i386 DLL APPDATA 861DC8F1
  (0.12%)      1545    MZ PE i386 APPDATA 73676567
  (0.12%)      1537    MZ PE i386 APPDATA 50415443
  (0.12%)      1517    MZ PE i386 APPDATA 5A425245
  (0.11%)      1458    MZ PE i386 APPDATA 60170000 DEB
  (0.11%)      1417    MZ PE i386 DLL Corrupted Tricky DEB
  (0.11%)      1374    MZ PE i386 APPDATA 68480000
  (0.11%)      1367    MZ PE i386 NullSoft 25-Apr-2011.cvs
  (0.11%)      1359    MZ PE i386 APPDATA 3C62696E
  (0.10%)      1288    MZ PE i386 APPDATA 88190000
  (0.10%)      1272    MZ PE i386 APPDATA 980E0000
  (0.10%)      1219    MZ PE i386 APPDATA 6BD6EB2C
  (0.10%)      1213    MZ PE i386 InnoSetup SIG
  (0.09%)      1176    MZ PE i386 InstallShield DEB
  (0.09%)      1174    MZ PE i386 APPDATA 680C0000
  (0.09%)      1159    MZ PE i386 CAB SFX (shifted)
  (0.09%)      1137    MZ PE i386 SYS DLL DEB
  (0.09%)      1122    MZ PE i386 APPDATA 90909090
  (0.09%)      1102    MZ PE i386 APPDATA 00A80000 DEB
  (0.09%)      1091    MZ PE i386 APPDATA 05000000
  (0.09%)      1087    MZ PE i386 .NET DLL
  (0.09%)      1082    MZ PE i386 APPDATA 22A72792
  (0.08%)      1048    MZ PE i386 .NET Corrupted Tricky
  (0.08%)      1043    MZ PE i386 APPDATA C26402DF
  (0.08%)       990    MZ PE i386 Rar SFX (shifted) DEB
  (0.07%)       947    MZ PE i386 APPDATA 3C232440
  (0.07%)       903    MZ PE i386 DLL COM Appended MZ
  (0.07%)       896    MZ PE i386 NullSoft 2.14
  (0.07%)       892    MZ PE i386 Rar SFX (shifted)
  (0.07%)       885    MZ PE i386 APPDATA 0D0A0D0A
  (0.07%)       880    MZ PE i386 SYS DLL
  (0.07%)       877    MZ PE i386 NullSoft 01-Jun-2011.cvs SIG
  (0.07%)       874    MZ PE i386 SmartInstallMaker v.5.02
  (0.06%)       808    MZ PE i386 DLL COM SIG
  (0.06%)       807    MZ PE i386 NullSoft 2.37
  (0.06%)       802    MZ PE i386 ADAEBOOK
  (0.06%)       789    MZ PE i386 APPDATA 78766D00
  (0.06%)       764    MZ PE i386 DLL COM
  (0.06%)       737    MZ PE i386 Install Creator
  (0.06%)       719    MZ PE i386 APPDATA 2A2A2A2A
  (0.06%)       715    MZ PE i386 WebCompiler
  (0.06%)       707    MZ PE i386 APPDATA 00
  (0.05%)       693    MZ PE i386 APPDATA 08001700
  (0.05%)       669    MZ PE i386 APPDATA 00000000 SIG
  (0.05%)       665    MZ PE i386 NullSoft 2.24 SIG
  (0.05%)       656    MZ PE i386 APPDATA 31353836
  (0.05%)       651    MZ PE i386 DLL APPDATA 45474645 DEB
  (0.05%)       628    MZ PE i386 DLL DEB SIG
  (0.05%)       622    MZ PE i386 APPDATA 43434343
  (0.05%)       617    MZ PE i386 APPDATA 34120000

hstrings (release) – when all strings are attached…

November 18, 2012 in hstrings, Malware Analysis, Software Releases

In a recent post, I introduced a new tool – hstrings. Its purpose is to find strings of any sort, not only ANSI (ASCII really) and a Basic Latin subset of Unicode, but many encoding variants as well. Today I am releasing a first version of the tool and in this post I will provide more information about currently available options and modes of operations.

First of all, I  encourage you to read Microsoft’s page listing Code Page Identifiers (Windows) – this is a list that I used as a foundation for hstrings; the tool goes a bit further and splits these into multiple families and also tries to split Unicode sets into more manageable chunks, yet Code Page Identifiers are the best starting point to choose what strings one wants to search.

The tool works in multiple modes and requires a few options that will decide how the input is processed and how the output is generated, plus what encoding are included in the search.

Let’s see a few examples first…

Character Set recognition

Imagine you have a file that is encoded, but you are not sure what character set is being used for encoding and you have no clue what language it may be at all.

The approach one may take to find out more about the file encoding is… a simple brute force which means checking all possible encodings and trying to convert only a small chunk of bytes from the input file to see what happens.

This is how ‘probing’ option mode works in hstrings. Once you select the option, the tool will read 32 bytes of the input file and try to decode it using all the chosen encodings and send it to the standard output or to separate files (depends on output options discussed later).

In the previous article I presented a sample Russian text encoded with various encodings.

If we try to run the hstring over one of these files

hstrings -qpsC test\russian_u16be.txt > out

we will get the following output:

As we can see, the longest meaningful string was produced by Unicode Cyrillic. Indeed, the file name contains suffix ‘u16be’ which is how I named the sample file encoded with a 16-bit Unicode Big Endian encoding.

We can then try running the same command on the data saved with a different encoding:

hstrings -qpsC test\russian_utf8.txt > out

Of course, this time we are not lucky as the ‘C’ option we used only applies Cyrillic encodings (see option details at the bottom of the post), and the result shows that none of them succeeded:

We can extend the list – and since it’s just an example we can be greedy – by using all encodings (option ’0′)

hstrings -qps0 test\russian_utf8.txt > out

Browsing through results we can see that this time we got the UTF-8 encoding giving quite a good output

Indeed, my naming convention reveals that it is a Russian text saved using UTF8 encoding.

Certainly, what helps in character set recognition is at least basic knowledge on how texts in various languages look like; anyone who saw Russian text previously shouldn’t have a problem picking up the correct output (encoding) presented in this example, but if you have never seen Cyrillic text before, this can be quite challenging. One way of improving the algorithm I have in mind is adding some wordlists to additionally recognize the known words in a specific language.

Extracting all strings

One aspect of the character set recognition is the actual detection of the matching encoding, now one can simply extract all strings in this encoding from the whole file. You can do it by replacing ‘p’ (probing character set) with ‘d’ (dump strings).

Since we now know that the last file has been encoded with UTF8, we can extract all strings using ’8′ options which means UTF8:

hstrings -qds8 test\russian_utf8.txt > out

The output looks like this:

Due to a number of encodings supported by hstrings, at the moment there is no possibility of specifying a single character set, except for very popular ones and this includes UTF8; I may add option for specific code pages/encodings if there is a demand.

 OPTIONS

Let’s walk through them one by one

  • GENERAL OPTIONS:
    •  - q – quiet (no banner) – basically no copyright information
  • INPUT OPTIONS – dictate whether we read the whole input file or just first 32 bytes
    • - p – probe first 32 bytes of a file
    • - d – dump strings from the whole file
  • OUTPUT OPTIONS -provide a choice to save the output in a single file (standard output one can redirect to a file), or multiple files (in such cse file names will have a ‘h_’ prefix and a code page as a name
    • - s – dump strings to standard output (use pipe to save to file)
    • - m – dump strings to multiple files (one encoding=one file)
  • ENCODINGS – these are grouped by families

    • - 0 – All supported encodings
    • - 1 – All Windows ANSI, UTF8, ASCII subset of Uni-LE/Uni-BE
    • - 2 – All Windows ANSI encodings
    • - 7 – UTF7
    • - 8 – UTF8
    • - U – Unicode encodings (except utf8/utf7)
    • - I – All IBM encodings
    • - E – IBM EBCDIC encodings (subset of I)
    • - M – MAC encodings
    • - A – Arabic encodings
    • - C – Cyrillic encodings
    • - H – Hebrew encodings
    • - J – Japanese encodings
    • - K – Korean encodings
    • - Z – Chinese encodings

Final word

This is an experimental tool and it is far from a final – I am personally aware of a few bugs and imperfections that I need to address (e.g. Unicode maps are far from perfect and sometimes produce too much output; generally too much output is still an issue), but if you want to test it feel free and I will appreciate any feedback. Thanks!

Download

You can download the tool here.

Random Stats from 24k drivers – APIs

November 12, 2012 in Batch Analysis, Malware Analysis

Over last few months I have been publishing various stats pulled out of malware collection that I am batch analyzing. The purpose of analysis is not only just getting interesting numbers and utilizing it as a nice filler for the blog :-) – all this data is being retrieved with a purpose of enhancing HexDive and for my other projects.Until now, I have been presenting data from a superset of all malicious PE files in a collection.  It crossed my mind recently that it would be interesting to focus on a subset of PE files as well and for starters I picked up kernel drivers.

Getting all strings and then cherrypicking up system functions out of the samples is relatively quick as there is not so many of them – the result of top 100 most popular APIs sorted by number of occurrence is presented below:

18431    RtlInitUnicodeString
16625    IofCompleteRequest
16214    ExAllocatePoolWithTag
14783    ZwClose
12899    MmGetSystemRoutineAddress
12002    ZwOpenKey
11911    ObfDereferenceObject
11719    IoCreateDevice
11430    IoGetCurrentProcess
11411    ExFreePool
11395    IoDeleteDevice
11198    RtlAnsiStringToUnicodeString
10969    ZwCreateFile
10895    wcslen
10848    strncmp
10672    strncpy
10585    wcscpy
10195    IoCreateSymbolicLink
10141    swprintf
9957    wcscat
9899    PsCreateSystemThread
9495    MmIsAddressValid
9466    ZwSetValueKey
9112    PsLookupProcessByProcessId
9106    ObReferenceObjectByHandle
8971    PsGetVersion
8630    ZwCreateKey
8600    RtlCopyUnicodeString
8334    KeDelayExecutionThread
7925    RtlCompareUnicodeString
7886    wcsncpy
7861    ZwQueryValueKey
7525    KeTickCount
7135    KeQuerySystemTime
7052    IoRegisterDriverReinitialization
6674    PsSetCreateProcessNotifyRoutine
5968    ExFreePoolWithTag
5671    ZwEnumerateKey
5427    ZwQuerySystemInformation
5414    ZwSetInformationFile
5249    ZwDeleteKey
5072    wcsstr
5017    KeWaitForSingleObject
4922    ZwCreateSection
4855    ZwMapViewOfSection
4757    IoDeleteSymbolicLink
4747    PsTerminateSystemThread
4708    wcschr
4605    wcsrchr
4540    KeServiceDescriptorTable
4226    KeQueryTimeIncrement
4218    ZwUnmapViewOfSection
4070    IoDeviceObjectType
3941    ZwReadFile
3740    KeInitializeEvent
3706    KeInitializeTimer
3562    ObQueryNameString
3538    ZwWriteFile
3522    KeSetEvent
3495    DbgPrint
3470    KeGetCurrentIrql
3381    KeBugCheckEx
3313    ZwQueryInformationFile
3286    ZwOpenFile
3232    IoFreeMdl
3171    RtlInitAnsiString
3043    memcpy
3037    IofCallDriver
2897    memset
2892    RtlFreeUnicodeString
2870    IoAllocateMdl
2629    MmProbeAndLockPages
2461    MmUnlockPages
2349    RtlUnicodeStringToAnsiString
2340    ZwAllocateVirtualMemory
2291    IoFreeIrp
2265    MmMapLockedPagesSpecifyCache
2144    KeGetCurrentThread
2134    KfReleaseSpinLock
2090    RtlFreeAnsiString
2031    KeStackAttachProcess
2025    KfRaiseIrql
2022    KfLowerIrql
1997    IoAllocateIrp
1997    ExAllocatePool
1994    RtlCompareMemory
1967    ExGetPreviousMode
1930    RtlTimeToTimeFields
1918    sprintf
1896    KeUnstackDetachProcess
1884    KfAcquireSpinLock
1870    ZwOpenProcess
1808    PsGetCurrentProcessId
1795    KeReleaseMutex
1747    RtlAppendUnicodeToString
1746    KeInitializeSpinLock
1740    IoCreateFile
1729    ProbeForRead
1727    KeClearEvent
1713    RtlUnwind

hstrings – when all strings are attached…

November 5, 2012 in Forensic Analysis, hstrings, Malware Analysis, Software Releases

TL;DR;

a new strings tool that attempts to extract localized strings e.g. French, Chinese from an input file; see example below

Intro

Traditional strings utilities are usually limited to ANSI/Unicode-LE/Unicode-BE strings. This is understandable as these are the most prevalent type of strings that we come across in our daily work.  However, many files exist that contain more strings – these we usually miss as they contain accented letters and these break the typical string extraction algorithms. On top of that there are a lot various character encodings out there that make it non-trivial to pick up right bytes in a regular expression or a state machine. One can have accented letters saved as Unicode-LE, Unicode-BE, UTF8, or using one of many legacy encodings e.g. Windows Code Pages or IBM EBCDIC encodings.

For quite some time I had in mind an idea to write a smarter strings extraction program that would take this localization/encoding mess into account so even before I released RUStrings I had been already thinking to write something more generic. In other words, I wanted to write a tool that can extract strings from a file in any well-known encoding and language possible.

As usual – I didn’t know what trouble I am getting myself into when I began :) .

As mentioned earlier, there are many encodings used by various platforms and the same string of bytes can be… a random garbage… or it can be  representing a string of characters encoded in one of at least 150 encodings possible including not only legacy encodings, but also Unicode. And not Unicode seen as a subset of characters belonging to ASCII set interleaved by zeros  (‘simplified Unicode’ that string extraction tools rely on), but Unicode that includes blocks dedicated to specific languages and letters e.g. Chinese, Cyrillic, Hangul, etc.

The tool I present below attempts to:

  • read an input file,
  • walk through the file content
  • apply heuristics and find characters encoded as:
    • bytes (ANSI and other legacy character sets)
    • words (Unicode LE, Unicode BE, and DBCS)
    • byte sequences (utf-8, utf-7, MBCS – multibyte encodings e.g. iso-2022-jp (Japanese) , GB18030 Simplified Chinese etc.)
  • it then normalizes these code points to Unicode LE
  • and appends the strings to an output file for a specific encoding

At this stage program is in alpha stage as I am still not sure how to present the output properly. Currently the program generates a lot of output files. Way too many. But it is not trivial to make it simpler.

From a data processing perspective it is actually quite a complex problem – since bytes can be interpreted in many ways, the program needs to show all of all the possible strings extracted from a file. The same string of bytes can be easily interpreted as some legacy ANSI code page (actually, simultaneously almost all of them), or as Chinese multibyte encoding – it then needs to normalize the output to unicode, so we have multiple unicode streams coming out of multiple decoders and in the same location of the file. My detection algorithm relies on state machine-like heuristics and it outputs data as it goes through the data. Since the various encoding heuristics are applied at once (one pass through a file), outputting data to a file may cause race conditions and streams from various decoders can start interleaving – leading to a mess. So, currently the output is in different files. I have a few ideas on how to solve, but each has a trade off associated with it, so stay tuned :)

Okay, enough babbling and boring theory – let’s look at some example.

EXAMPLE

First, we need to create a a few text sample files that contain some random text in various languages encoded in many different encodings.

I generated a few non-sensical lorem-ipsum texts by Lorem Ipsum Generator.

Russian

Нам аутым убяквюэ нолюёжжэ ад. Нам граэкы компльыктётюр нэ. Квуй видырэр ёнэрмйщ ку, прё ат фиэрэнт элььэефэнд эррорибуз. Ан нам фэюгаят юлламкорпэр интылльэгэбат. Пэр декам квюаэчтио эа, эним витаэ июварыт вэл экз, эа емпэтюсъ элыктрам шэа. Ед съюммо ыльигэнди мэль, ыам эи кхоро кэтэро зальютатуж, одео нюмквуам мэнтётюм эа квуй.

Chinese

主谷三間機望飼営電時始能快本面一界。約握企曜回金忙出行場説必確天下員週。連芸止嘩健集人説火忘冠率庭泉。田位国以供地紹臣同旅百出済理強波。球告続況時心断主別重並行県邦不康。記悪暮投氏性善治地長中消。小作解共供小田民覧花伝聞団点。止都要空性難改大境新真権軽降真細登皇。読道決集房休講員軟渡慎無告書。社風理載当宿竹金来簡月教。

Greek

Ιδ φιμ ιλλυδ αλικυαμ συσιπιθ, ετ ηαβεο σανστυς κυι, θεμπορ λυπταθυμ σομπρεχενσαμ μει αν. Υθροκυε νολυισε νες ετ, αδχυς οφφισιις ινφιδυντ αδ σεα. Συ νες λιβρις θιμεαμ. Φιξ μαζιμ λυπταθυμ δελισαθισιμι υθ. Περ υθ πωσε μυνερε.

Luxembourgish

As Fläiß ménger Stieren dat. An och sinn Stret gewalteg, wär am gutt d’Land hinnen, wäit eraus ménger si dee. Feld löschteg mä gei. Fu sou deser Riesen, Blummen löschteg hun jo.

 I then saved these files with different encodings:

  • Russian: 1251, koi8-R, Unicode-BE, Unicode-LE, UTF8
  • Chinese: utf8, GB2312, GB18030
  • Greek: Unicode-BE, 1253
  • Luxembourgish: 1252, Unicode-LE

Once done, I combined all of the files into one large file – now the sample file contains multiple texts in multiple different languages saved in multiple different character encodings:

Running htrings over the file produces multiple output files:

Yes, it’s quite a lot and reviewing them all is atm an overkill; I have already mentioned that I am still thinking how to improve the presentation layer :-)

The rule of a thumb is to start with Windows ANSI code pages, UTF8, Unicode-LE (ULE*) and Unicode-BE (UBE*) and of course cheat – we can go ahead and look at the files associated with the encodings we used in the example above i.e. Russian, Greek, etc. – after all, it’s just an example :) :

Previewing the result files gives us the following:

  • h_GB18030,GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)

  • h_windows-1253,ANSI Greek; Greek (Windows)

  • h_windows-1251,ANSI Cyrillic; Cyrillic (Windows)

  • h_windows-1252,ANSI Latin 1; Western European (Windows)

So, it would seem that it works…

 

I will be releasing the first version of hstrings soon.

Thanks for reading!