We got so used to ‘see’ Unicode strings as being made up of characters that occupy 2-bytes that we often forget that it’s actually not true – using 2 bytes is just a convenient way to represent most of the common characters, but the standard allows us to use characters that are outside of that 16-bit spectrum. To represent them it uses sth called high- and low- surrogates:
Surrogates. The UCS includes 2,048 code points in the Basic Multilingual Plane (BMP) for surrogate code point pairs. Together these surrogates allow any code point in the sixteen other planes to be addressed by using two surrogate code points. This provides a simple built-in method for encoding the 20.1 bit UCS within a 16 bit encoding such as UTF-16. In this way UTF-16 can represent any character within the BMP with a single 16-bit byte. Characters outside the BMP are then encoded using two 16-bit bytes (4 octets total) using the surrogate pairs.
Why I am writing about it?
I just stumbled upon a Registry key that is using the 4-byte long Unicode Characters in Windows 10 ;):
It looks like a gimmick, and someone probably had a bit of fun implementing it, but this is actually a legitimate entry being queried when Windows starts!
The characters are (by their binary representation):
- 3C D8 0E DF
- 3C D8 0F DF
- 3C D8 0D DF
I wonder what could be impacted by the “Unicode string=16-bit characters” assumption:
- I guess not all tools may support UCS properly
- if they assume Unicode is 16-bit/use their own parsers w/o taking into account surrogates (I am guilty as charged, I often simplify my scripts this way)
- obviously, most of ‘strings’ tools fail on this too (but most of them fail on non-English Unicode strings anyway)
- many fonts don’t support surrogates and they can’t display them (Win10 Consolas does, win7 Arial Unicode doesn’t)
- I noticed that cmd.exe on Win10 can’t ‘see’ these properly and there is no direct way to change the font to Consolas – see below the folder named same way as the key – as seen in Explorer and in cmd terminal:
- who knows, maybe malware will start using it too
Anyway, it’s more a trivia than anything else…
if you want to test your tool, run it on non-windows10 OS version; this way you will see if the app supports it both from the analysis perspective (proper parsing of UCS strings) and visually (font)