We are at our company planning to make our application Unicode-aware, and we are analyzing what problems we are going to encounter.
Particularly, our application will for example rely heavily on lengths of strings and we would like to use wchar_t
as base character class.
The problem arises when dealing with characters that must be stored in 2 units of 16 bits in UTF-16, namely characters above U+10000.
Simple example:
I have the UTF-8 string "蟂" (Unicode character U+87C2, in UTF-8: E8 9F 82)
So, I set the following code:
const unsigned char my_utf8_string[] = { 0xe8, 0x9f, 0x82, 0x00 };
// compute size of wchar_t buffer.
int nb_chars = ::MultiByteToWideChar(CP_UTF8, // input is UTF8
0, // no flags
reinterpret_cast<char *>(my_utf8_string), // input string (no worries about signedness)
-1, // input is zero-terminated
NULL, // no output this time
0); // need the necessary buffer size
// allocate
wchar_t *my_utf16_string = new wchar_t[nb_chars];
// convert
nb_chars = ::MultiByteToWideChar(CP_UTF8,
0,
reinterpret_cast<char *>(my_utf8_string),
-1,
my_widechar_string, // output buffer
nb_chars); // allocated size
Okay, this works, it allocates twice 16 bits, and my buffer of wchar_t
contains { 0x87c2, 0x0000 }. If I store it inside a std::wstring
and compute the size, I get 1.
Now, let us take character (U+104A2) as input, in UTF-8: F0 90 92 A2.
This time, it allocates space for three wchar_t and std::wstring::size returns 2 even though I consider that I only have one character.
This is problematic. Let us assume that we receive data in UTF-8. We can count Unicode characters simply by not counting bytes that equate to 10xxxxxx
. We would like to import that data in an array of wchar_t
to work with it. If we just allocate the number of characters plus one, it might be safe... until some person uses a character above U+FFFF. And then our buffer will be too short and our application will crash.
So, with the same string, encoded in different ways, functions that count characters in a string will return different values?
How are applications that work with Unicode strings designed in order to avoid this sort of annoyances?
Thank you for your replies.