Please clarify for me, how does UTF16 work? I am a little confused, considering these points:
- There is a static type in C++, WCHAR,
which is 2 bytes long. (always 2 bytes long obvisouly)(UPDATE: as shown by the answers, this assumption was wrong). - Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way.
- There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.
- To my uncertain knowledge, unicode has a lot more characters than 65535, so they obvisouly don't have enough space in 2 bytes.
- UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.
So if a UTF16 character not always 2 bytes long, how long else could it be? 3 bytes? or only multiples of 2? And then for example if there is a winapi function that wants to know the size of a wide string in characters, and the string contains 2 characters which are each 4 bytes long, how is the size of that string in characters calculated?
Is it 2 chars long or 4 chars long? (since it is 8 bytes long, and each WCHAR is 2 bytes)
UPDATE: Now I see that character-counting is not necessarily a standard-thing or a c++ thing even, so I'll try to be a little more specific in my second question, about the length in "characters" of a wide string:
On Windows, specifically, in Winapi, in their wide functions (ending with W), how does one count the numer of characters in a string that consists of 2 unicode codepoints, each consisting of 2 codeunits (total of 8 bytes)? Is such a string 2 characters long (the same as number of codepoints) or 4 characters long(the same as total number of codeunits?)
Or, being more generic: What does the windows definition of "number of characters in a wide string" mean, number of codepoints or number of codeunits?