UTF Encoding for "Ü" returns 3 bytes instead of the "real" unicode

Question

I was playing around with the code mentioned in: https://stackoverflow.com/a/21575607/2416394 as I have issues writing proper utf8 xml with TinyXML.

Well, I need to encode the "LATIN CAPITAL LETTER U WITH DIAERESIS", which is Ü to be properly written to XML etc.

Here is the code take from the post above:

std::string codepage_str = "Ü";
int size = MultiByteToWideChar( CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
                                codepage_str.length(), nullptr, 0 );
std::wstring utf16_str( size, '\0' );
MultiByteToWideChar( CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
                     codepage_str.length(), &utf16_str[ 0 ], size );

int utf8_size = WideCharToMultiByte( CP_UTF8, 0, utf16_str.c_str(),
                                     utf16_str.length(), nullptr, 0,
                                     nullptr, nullptr );
std::string utf8_str( utf8_size, '\0' );
WideCharToMultiByte( CP_UTF8, 0, utf16_str.c_str(),
                     utf16_str.length(), &utf8_str[ 0 ], utf8_size,
                     nullptr, nullptr );

The result is an std::string which has the size of 3 with the following bytes:

-       utf8_str    "UÌˆ"   std::basic_string<char,std::char_traits<char>,std::allocator<char> >
        [size]  0x0000000000000003  unsigned __int64
        [capacity]  0x000000000000000f  unsigned __int64
        [0] 0x55 'U'    char
        [1] 0xcc 'Ì'    char
        [2] 0x88 'ˆ'    char

When I write it into an utf8 file. The hex values remain there: 0x55 0xCC 0x88 and Notepad++ shows me the proper char Ü.

However when I add another Ü to the file via Notepad++ and save it again then the newly written Ü is displayed as 0xC3 0x9C (which I've actually expected in the first place).

I do not understand, why I get a 3 byte representation of this character and not the expected unicode codepoint U+00DC.

Although Notepad++ displays it correctly, our proprietary system renders 0xC3 0x 9C as Ü and breaks on 0x55 0xCC 0x88 by rendering UÌˆ not recognizing it as a two byte utf 8

score 8 · Accepted Answer · edited Sep 22 '16 at 02:03

8

Unicode is complicated. There are at least two different ways to get Ü:

LATIN CAPITAL LETTER U WITH DIAERESIS is Unicode codepoint U+00DC.
LATIN CAPITAL LETTER U is Unicode codepoint U+0055, and COMBINING DIAERESIS is Unicode codepoint U+0308.

U+00DC and U+0055 U+0308 both display as Ü.

In UTF-8, Unicode codepoint U+00DC is encoded as 0xC3 0x9C, U+0055 is encoded as 0x55, and U+0308 is encoded as 0xCC 0x88.

Your proprietary system seems to have a bug.

Edit: to get what you expect, according to the MultiByteToWideChar() documentation, use MB_PRECOMPOSED instead of MB_COMPOSITE.

edited Sep 22 '16 at 02:03

Remy Lebeau

555,201
31
458
770

answered Sep 20 '16 at 08:16

RemcoGerlich

30,470
6
61
79

3

Additionally, conversions between these different forms are known as [Unicode normalization](http://unicode.org/faq/normalization.html). – user694733 Sep 20 '16 at 08:22

score 3 · Answer 2 · edited Sep 22 '16 at 02:04

While the encoding output is technically correct, you can work around the problem in the proprietary systems by using the NFC form.

In NFC form, all characters are first decomposed (for example, if you had codepoint U+00DC for Ü, it would get decomposed to the sequence U+0055 U+0308) and then re-composed to their canonical representation (in your example, as U+00DC).

In the Win32 API, see the NormalizeString() function.

UTF Encoding for "Ü" returns 3 bytes instead of the "real" unicode

2 Answers2