I was playing around with the code mentioned in: https://stackoverflow.com/a/21575607/2416394 as I have issues writing proper utf8 xml with TinyXML.
Well, I need to encode the "LATIN CAPITAL LETTER U WITH DIAERESIS", which is Ü
to be properly written to XML etc.
Here is the code take from the post above:
std::string codepage_str = "Ü";
int size = MultiByteToWideChar( CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
codepage_str.length(), nullptr, 0 );
std::wstring utf16_str( size, '\0' );
MultiByteToWideChar( CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
codepage_str.length(), &utf16_str[ 0 ], size );
int utf8_size = WideCharToMultiByte( CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), nullptr, 0,
nullptr, nullptr );
std::string utf8_str( utf8_size, '\0' );
WideCharToMultiByte( CP_UTF8, 0, utf16_str.c_str(),
utf16_str.length(), &utf8_str[ 0 ], utf8_size,
nullptr, nullptr );
The result is an std::string which has the size of 3 with the following bytes:
- utf8_str "Ü" std::basic_string<char,std::char_traits<char>,std::allocator<char> >
[size] 0x0000000000000003 unsigned __int64
[capacity] 0x000000000000000f unsigned __int64
[0] 0x55 'U' char
[1] 0xcc 'Ì' char
[2] 0x88 'ˆ' char
When I write it into an utf8 file. The hex values remain there: 0x55 0xCC 0x88
and Notepad++ shows me the proper char Ü
.
However when I add another Ü
to the file via Notepad++ and save it again then the newly written Ü
is displayed as 0xC3 0x9C
(which I've actually expected in the first place).
I do not understand, why I get a 3 byte representation of this character and not the expected unicode codepoint U+00DC.
Although Notepad++ displays it correctly, our proprietary system renders 0xC3 0x 9C
as Ü
and breaks on 0x55 0xCC 0x88
by rendering Ü
not recognizing it as a two byte utf 8