7

Please clarify for me, how does UTF16 work? I am a little confused, considering these points:

  • There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly) (UPDATE: as shown by the answers, this assumption was wrong).
  • Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way.
  • There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.
  • To my uncertain knowledge, unicode has a lot more characters than 65535, so they obvisouly don't have enough space in 2 bytes.
  • UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.

So if a UTF16 character not always 2 bytes long, how long else could it be? 3 bytes? or only multiples of 2? And then for example if there is a winapi function that wants to know the size of a wide string in characters, and the string contains 2 characters which are each 4 bytes long, how is the size of that string in characters calculated?

Is it 2 chars long or 4 chars long? (since it is 8 bytes long, and each WCHAR is 2 bytes)

UPDATE: Now I see that character-counting is not necessarily a standard-thing or a c++ thing even, so I'll try to be a little more specific in my second question, about the length in "characters" of a wide string:

On Windows, specifically, in Winapi, in their wide functions (ending with W), how does one count the numer of characters in a string that consists of 2 unicode codepoints, each consisting of 2 codeunits (total of 8 bytes)? Is such a string 2 characters long (the same as number of codepoints) or 4 characters long(the same as total number of codeunits?)

Or, being more generic: What does the windows definition of "number of characters in a wide string" mean, number of codepoints or number of codeunits?

Cray
  • 2,396
  • 19
  • 29
  • 1
    A UTF-16 code unit is always two bytes. A Unicode character may take up 1 or 2 code units. – President James K. Polk Jan 10 '11 at 23:07
  • 2
    Yeah. IMO UTF-16 is the worst of both worlds: always takes up more space than 8-bit encodings (like UCS-4), and does not have a constant codepoint size (like UTF-8). Granted, the latter not that important since combining codepoints makes it such that a logical character can always have variable size representation, but the former is worse than UTF-8. – ephemient Jan 10 '11 at 23:14
  • OK, so to answer my second question, a string of 2 UTF16 characters, where each character is 4 bytes long, is considered being 2-character long by all winapiW functions and such? – Cray Jan 10 '11 at 23:18
  • 1
    You can read this nice article: http://www.joelonsoftware.com/articles/Unicode.html – ruslik Jan 11 '11 at 00:07
  • ephemient, in light of this it seems especially strange that winapi uses UTF16 as their wide-format, not UTF8 (which would take a lot less space in 99% times), or a "full-unicode" format (in which characters would ALWAYS be of fixed size). – Cray Jan 11 '11 at 00:39
  • 1
    @Cray: UTF-16 (actually, UCS-2) *was* a convenient fixed-width encoding at the time that Windows NT was being developed. The expansion of Unicode beyond the BMP spoiled that. – dan04 Jan 11 '11 at 01:03
  • You still have this misconception. Characters aren't fixed size in any encoding scheme, also not in UTF-32. And because of that, having each codepoint represented by the same number of codeunits isn't that helpful. Also, text isn't really that big, unless you do very specialized applications, the size won't be an issue. So, the choice of UTF-8, UTF-16 or UTF-32 as in-memory storage has very little effect on most applications. – etarion Jan 11 '11 at 01:37
  • @etarion: don't confuse characters with glyphs. – Seva Alekseyev Jan 29 '11 at 02:00
  • @Seva: In the above text, "character" is "character" as defined by unicode. – etarion Jan 29 '11 at 09:28

8 Answers8

8

Short answer: No.

The size of a wchar_t—the basic character unit—is not defined by the C++ Standard (see section 3.9.1 paragraph 5). In practice, on Windows platforms it is two bytes long, and on Linux/Mac platforms it is four bytes long.

In addition, the characters are stored in an endian-specific format. On Windows this usually means little-endian, but it’s also valid for a wchar_t to contain big-endian data.

Furthermore, even though each wchar_t is two (or four) bytes long, an individual glyph (roughly, a character) could require multiple wchar_ts, and there may be more than one way to represent it.

A common example is the character é (LATIN SMALL LETTER E WITH ACUTE), code point 0x00E9. This can also be represented as “decomposed” code point sequence 0x0065 0x0301 (which is LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). Both are valid; see the Wikipedia article on Unicode equivalence for more information.

Simply, you need to know or pick the encoding that you will be using. If dealing with Windows APIs, an easy choice is to assume everything is little-endian UTF-16 stored in 2-byte wchar_ts.

On Linux/Mac UTF-8 (with chars) is more common and APIs usually take UTF-8. wchar_t is seen to be wasteful because it uses 4 bytes per character.

For cross-platform programming, therefore, you may wish to work with UTF-8 internally and convert to UTF-16 on-the-fly when calling Windows APIs. Windows provides the MultiByteToWideChar and WideCharToMultiByte functions to do this, and you can also find wrappers that simplify using these functions, such as the ATL and MFC String Conversion Macros.

Update

The question has been updated to ask what Windows APIs mean when they ask for the “number of characters” in a string.

If the API says “size of the string in characters” they are referring to the number of wchar_ts (or the number of chars if you are compiling in non-Unicode mode for some reason). In that specific case you can ignore the fact that a Unicode character may take more than one wchar_t. Those APIs are just looking to fill a buffer and need to know how much room they have.

Nate
  • 18,752
  • 8
  • 48
  • 54
5

You seem to have several misconception.

There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly)

This is wrong. Assuming you refer to the c++ type wchar_t - It is not always 2 bytes long, 4 bytes is also a common value, and there's no restriction that it can be only those two values. If you don't refer to that, it isn't in C++ but is some platform-specific type.

  • There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.

  • UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.

UTF-8 and UTF-16 are different encodings for the same character set, so UTF-16 is not "bigger". Technically, the scheme used in UTF-8 could encode more characters than the scheme used in UTF-16, but as UTF-8 and UTF-16 they encode the same set.

Don't use the term "character" lightly when it comes to unicode. A codeunit in UTF-16 is 2 bytes wide, a codepoint is represented by 1 or 2 codeunits. What humans usually understand as "characters" is different and can be composed of one or more codepoints, and if you as a programmer confuse codepoints with characters bad things can happen like http://ideone.com/qV2il

etarion
  • 16,935
  • 4
  • 43
  • 66
  • I am specifically referring to the term "character" as it is used for example in Winapi functions. They all write things like "size of the string in characters". But thanks for clarifying the codepoint & codeunit. – Cray Jan 11 '11 at 00:21
  • 1
    the original UTF-8 spec, defined in RFCs 2044 and 2279, supported a max of 6 codeunits (U-0000 to U+7FFFFFFF). However, for interoperability with UTF-16, RFC 3629 (now adopted by Unicode itself) limits UTF-8 to a max of 4 codeunits so it uses the same range of codepoints that UTF-16 supports (U-0000 to U+1FFFFF). This makes the two more compatible for lossless conversions. – Remy Lebeau Jan 11 '11 at 01:22
  • @TeamB: Why do you tell me something I know? – etarion Jan 11 '11 at 01:42
4

Windows' WCHAR is 16 bits (2 bytes) long.

A Unicode codepoint may be represented by one or two of these WCHAR – 16 or 32 bits (2 or 4 bytes).

wcslen returns number of WCHAR units in a wide string, while wcslen_l returns the number of (locale-dependent) codepoints. Obviously, wcslen <= wcslen_l.

A Unicode character may consist of multiple combining codepoints.

ephemient
  • 198,619
  • 38
  • 280
  • 391
  • Thanks! But don't you mean codeunits (not codepoints)(as explained in etarion's answer) in your very last sentence? Otherwise your explanation means that a unicode character can be represented by more than 4 bytes, is that how you have meant it? – Cray Jan 11 '11 at 00:33
  • 1
    @Cray: Yes, that is how I meant it. A Unicode grapheme (what you would consider a character) can be made up of multiple combining codepoints; a Unicode codepoint can be made up of one or two UTF-16 units. – ephemient Jan 11 '11 at 00:49
2

Short story: UTF-16 is a variable-length encoding. A single character may be one or two widechars long.

HOWEVER, you may very well get away with treating it as a fixed-length encoding where every character is one widechar (2 bytes). This is formally called UCS-2, and it used to be Win32's assumption until Windows NT 4. The UCS-2 charset includes practically all living, dead and constructed human languages. And truth be told, working with variable-length encoding strings just sucks. Iteration becomes O(n) operation, string length is not the same as string size, etc. Any sensible parsing becomes a pain.

As for the UTF-16 chars that are not in UCS-2... I only know two subsets that may theoretically come up in real life. First is emoji - the graphical smileys that are popular in the Japanese cell phone culture. On iPhone, there's a bunch of third-party apps that enable input of those. Except on mobile phones, they don't display properly. The other character class is VERY obscure Chinese characters. The ones even most Chinese don't know. All the popular Chinese characters are well inside UCS-2.

Seva Alekseyev
  • 59,826
  • 25
  • 160
  • 281
  • One exmaple for a character that is more than one codepoint long under some circumstances is ä since it is sometimes composed from ¨ and ä – plaisthos Jan 10 '11 at 23:37
  • Does this mean that all UCS-2 characters are also valid UTF16 characters? (that UTF16 is a superset of UCS-2) ? – Cray Jan 11 '11 at 00:25
  • 1
    Yes. UCS-2 only supported the BMP (codepoints U-0000 to U-FFFF). UTF-16 supports the BMP using the same encoding scheme, and then uses surrogate pairs for higher codepoints. – Remy Lebeau Jan 11 '11 at 01:26
  • 1
    More precisely, UTF-16 supports the BMP *except* for the range U+D800 to U+DFFF, which is reserved for the specific purpose of surrogate pairs. – dan04 Jan 11 '11 at 02:56
2

There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly)

Well WCHAR is an MS thing not a C++ thing.
But there is a wchar_t for wide character. Though this is not always 2. On Linux system it is usually 4 bytes.

Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way.

Do they. I can believe it.

There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.

C/C++ make no assumption avout character encoding. Though the OS can. For example Windows uses UTF-16 as the interface while a lot of Linus use UTF-32. But you need to read the documentation for each interface to know explicitly.

To my uncertain knowledge, unicode has a lot more characters than 65535, so they obvisouly don't have enough space in 2 bytes.

2 bytes is all you need for numbers 0 -> 65535

But UCS (the encoding that UTF is based on) has 20 bits per code point. Thus some code points are encoded as 2 16byte characters in UTF-16 (These are refereed to as surrogate pairs).

UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.

UTF-8/UTF-16 and UTF-32 all encode the same set of code points (which are 20 bytes per code point). UTF-32 is the only one that has a fixed size (UTF-16 was supposed to be fixed size but then they found lots of other characters (Like Klingon) that we needed to encode and we ran out of space in plane 0. So we added 32 more plains (hence the four extra bits).

So if a UTF16 character not always 2 bytes long, how long else could it be? 3 bytes? or only multiples of 2?

It is either 1 16 bit character or 2 16 bit characters.

And then for example if there is a winapi function that wants to know the size of a wide string in characters, and the string contains 2 characters which are each 4 bytes long, how is the size of that string in characters calculated?

You have to step along and calculate each character one at a time.

Is it 2 chars long or 4 chars long? (since it is 8 bytes long, and each WCHAR is 2 bytes)

All depneds on your system

Martin York
  • 257,169
  • 86
  • 333
  • 562
  • You misunderstood a couple of small points in my question, but all in all a great explanation, thanks! I am still unsure though how the length can be a system-dependent thing... – Cray Jan 11 '11 at 00:30
1

This Wikipedia article seems to be a good intro.

UTF-16 (16-bit Unicode Transformation Format) is a character encoding for Unicode capable of encoding 1,112,064 numbers (called code points) in the Unicode code space from 0 to 0x10FFFF. It produces a variable-length result of either one or two 16-bit code units per code point.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
1

According to the Unicode FAQ it could be

one or two 16-bit code units

Windows uses 16 bit chars - probably as Unicode was originally 16 bit. So you don't have an exact map - but you might be able to get away with treating all strings you see as just containing 16 but unicode characters,

mmmmmm
  • 32,227
  • 27
  • 88
  • 117
1

All characters in the Basic Multilingual Plane will be 2 bytes long.

Characters in other planes will be encoded into 4 bytes each, in the form of a surrogate pair.

Obviously, if a function does not try to detect surrogate pairs and blindly treats each pair of bytes as a character, it will bug out on strings that contain such pairs.

Jon
  • 428,835
  • 81
  • 738
  • 806
  • Indeed so, but how likely is it that you encounter a character from a higher plane? – Seva Alekseyev Jan 10 '11 at 23:32
  • @Seva: depends on where you live and who you do business with. Or, more importantly, who your customers do business with. – Jon Jan 10 '11 at 23:34
  • Also what do you do to the strings. If just store and display, then it's almost never an issue, unless you try to determine visible string length from the character count (which is a bad idea for other reasons, too). For parsing purposes, however... Depending on the nature of parsing, it could become quite a can of worms. – Seva Alekseyev Jan 11 '11 at 17:33