Just updated 3R to include the latest snapshot from https://github.com/keydet89/RegRipper2.8.
Just updated 3R to include the latest snapshot from https://github.com/keydet89/RegRipper2.8.
We got so used to ‘see’ Unicode strings as being made up of characters that occupy 2-bytes that we often forget that it’s actually not true – using 2 bytes is just a convenient way to represent most of the common characters, but the standard allows us to use characters that are outside of that 16-bit spectrum. To represent them it uses sth called high- and low- surrogates:
As per https://en.wikipedia.org/wiki/Universal_Character_Set_characters:
Surrogates. The UCS includes 2,048 code points in the Basic Multilingual Plane (BMP) for surrogate code point pairs. Together these surrogates allow any code point in the sixteen other planes to be addressed by using two surrogate code points. This provides a simple built-in method for encoding the 20.1 bit UCS within a 16 bit encoding such as UTF-16. In this way UTF-16 can represent any character within the BMP with a single 16-bit byte. Characters outside the BMP are then encoded using two 16-bit bytes (4 octets total) using the surrogate pairs.
Why I am writing about it?
I just stumbled upon a Registry key that is using the 4-byte long Unicode Characters in Windows 10 ;):
HKEY_USERS\.DEFAULT\Control Panel\International\🌎🌏🌍
It looks like a gimmick, and someone probably had a bit of fun implementing it, but this is actually a legitimate entry being queried when Windows starts!
The characters are (by their binary representation):
I wonder what could be impacted by the “Unicode string=16-bit characters” assumption:
Anyway, it’s more a trivia than anything else…
Note:
if you want to test your tool, run it on non-windows10 OS version; this way you will see if the app supports it both from the analysis perspective (proper parsing of UCS strings) and visually (font)