30

Examining the attributes of UTF-16 and UTF-8, I can't find any reason to prefer UTF-16.

However, checking out Java and C#, it looks like strings and chars there default to UTF-16. I was thinking that it might be for historic reasons, or perhaps for performance reasons, but couldn't find any information.

Anyone knows why these languages chose UTF-16? And is there any valid reason for me to do that as well?

EDIT: Meanwhile I've also found this answer, which seems relevant and has some interesting links.

Community
  • 1
  • 1
Oak
  • 26,231
  • 8
  • 93
  • 152
  • 3
    In addition to my answer, I would say that .NET/C# chose UTF-16 because that's the "native" encoding of Windows: it's easier to interop with native Windows if you're using the same encoding. – Dean Harding May 29 '10 at 11:49
  • 1
    For what purposes are you choosing an encoding? UTF-16 is a reasonable choice for in-memory string handling, as is `wchar_t` which will be UTF-16 on Windows and typically UTF-32 elsewhere. But for on-the-wire protocols and file storage, UTF-8 is almost always the best choice. – bobince May 29 '10 at 12:29
  • 2
    @codeka: I agree (gave you +1), but then one could also ask the question "why is the native encoding of Windows UTF-16 and not UTF-8?". – Andreas Rejbrand May 29 '10 at 15:48
  • The qt c++ framework also uses utf-16 for strings – Roman A. Taycher Feb 27 '11 at 13:33
  • Prefer UTF-16 if it is native to your Operating System or Programming Language. That means Windows, C#, and Java mainly. Choose UTF-8 if it is native to your Operating System or Programming Language, or when your programming language doesn't really have a native encoding. This means *nix and Mac OS X, C, C++. If you're cross-platform from the start it seems easer to get UTF-8 to work nice on Windows than to use UTF-16 everywhere on *nix in the case of C/C++. Perl is designed to work with all encodings but implicit conversions lead to many errors. JavaScript including node.js uses UCS-2!! – hippietrail May 30 '13 at 01:50

7 Answers7

36

East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required).

Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there's a lot of markup) it's much of a muchness.

Processing of UTF-16 for user-mode applications is slightly easier than processing UTF-8, because surrogate pairs behave in almost the same way that combining characters behave. So UTF-16 can usually be processed as a fixed-size encoding.

Dean Harding
  • 71,468
  • 13
  • 145
  • 180
  • 1
    +1 For correctly characterizing the number of bytes per character in UTF-16 and UTF-8. – Joren May 29 '10 at 11:44
  • 1
    I thought UTF-8 can encode up to 4 bytes which pretty much makes UTF-16 and UTF-32 useless. – Razor May 29 '10 at 12:50
  • 1
    @Sir Psycho: UTF-8 is a variable-length encoding, which is more complex to process than a fixed-length encoding. Also, see my comments on Gumbo's answer: basically, combining characters exist in all encodings (UTF-8, UTF-16 & UTF-32) and they require special handling. You can use the same special handling that you use for combining characters to also handle surrogate pairs in UTF-16, so *for the most part* you can ignore surrogates and treat UTF-16 just like a fixed encoding. – Dean Harding May 29 '10 at 12:58
  • 3
    @Sir Psycho: UTF-8, UTF-16, and UTF-32 are all able to encode all the characters of Unicode. codeka was talking about how many bytes that result from encoding a "typical" Unicode character using UTF-8 and UTF-16. – President James K. Polk May 29 '10 at 12:58
  • 8
    The key word there is "can **usually** be processed as a fixed-size encoding". It is still absolutely incorrect to do so, if you care about the integrity of characters. What you're actually doing is meaning to writing code to manipulate "characters", but actually writing it to manipulate "16-bit chunks of data". If you mean to manipulate characters (swap them, uppercase them, reverse them, etc.), then you need to observe all the rules of the character encoding, not just the ones that are convenient. Software BLOWS UP, because people make stupid assumptions :( – Triynko Apr 13 '11 at 22:47
  • If you're swapping, reversing or upper-casing characters, then even in UTF-32, you need to consider combining characters. The point is, handling surrogate pairs is, for the most part, the same as handling combining characters. So if you're already handling combining characters correctly, there is almost nothing extra required to handle surrogate pairs. – Dean Harding Apr 14 '11 at 14:00
10

@Oak: this too long for a comment...

I don't know about C# (and would be really surprised: it would mean they just copied Java too much) but for Java it's simple: Java was conceived before Unicode 3.1 came out.

Hence there were less than 65537 codepoints, hence every Unicode codepoint was still fitting on 16-bit and so the Java char was born.

Of course this led to crazy issues that are still affecting Java programmers (like me) today, where you have a method charAt which in some case does return neither a Unicode character nor a Unicode codepoint and a method (added in Java 5) codePointAt which takes an argument which is not the number of codepoints you want you want to skip! (you have to supply to codePointAt the number of Java char you want to skip, which makes it one of the least understood method in the String class).

So, yup, this is definitely wild and confusing most Java programmers (most aren't even aware of these issues) and, yup, it is for historical reason. At least, that was the excuse that came up with when people got mad after this issue: but it's because Unicode 3.1 wasn't out yet.

:)

NoozNooz42
  • 4,238
  • 6
  • 33
  • 53
8

I imagine C# using UTF-16 derives from the Windows NT family of operating systems using UTF-16 internally.

I imagine there are two main reasons why Windows NT uses UTF-16 internally:

  • For memory usage: UTF-32 wastes a lot of space to encode.
  • For performance: UTF-8 is much harder to decode than UTF-16. In UTF-16 characters are either a Basic Multilingual Plane character (2 bytes) or a Surrogate Pair (4 bytes). UTF-8 characters can be anywhere between 1 and 4 bytes.

Contrary to what other people have answered - you cannot treat UTF-16 as UCS-2. If you want to correctly iterate over actual characters in a string, you have to use unicode-friendly iteration functions. For example in C# you need to use StringInfo.GetTextElementEnumerator().

For further information, this page on the wiki is worth reading: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Andrew Russell
  • 26,924
  • 7
  • 58
  • 104
  • Oh, and don't forget combining characters! (Which `GetTextElementEnumerator` will also handle.) – Andrew Russell May 29 '10 at 12:08
  • 2
    "...you cannot treat UTF-16 as UCS-2" - but many successful real-world applications do, and get away with it because they are only using BMP characters. – Joe May 29 '10 at 12:50
  • 1
    @Joe For simply pushing text about, just pretending each character is 2 bytes will "work" (just like you can often pretend UTF-8 is ASCII and hope for the best). In fact, that's what you're usually doing when you use `string` in C#. But what happens if I paste or load some text into your application in, say, a decomposed format? Anything that does processing on a character-by-character basis needs to do so with an actual understanding of how that text is encoded. (Fortunately most applications work on strings, not characters.) – Andrew Russell May 30 '10 at 09:38
  • 1
    The bigger reason is that the original Windows NT was released about the same time as Unicode 1.1, before there were supplementary planes. – dan04 Jul 06 '10 at 13:24
  • Insightfull thinking here, +1 and now lets wait for UTF64 ;) – Peter Dec 22 '17 at 11:35
3

It depends on the expected character sets. If you expect heavy use of Unicode code points outside of the 7-bit ASCII range then you might find that UTF-16 will be more compact than UTF-8, since some UTF-8 sequences are more than two bytes long.

Also, for efficiency reasons, Java and C# does not take surrogate pairs into account when indexing strings. This would break down completely when using code points that are represented with UTF-8 sequences that take up an odd number of bytes.

corvuscorax
  • 5,850
  • 3
  • 30
  • 31
  • Could you elaborate about "Java and C# do not take surrogate pairs into account when indexing strings"? – Oak May 29 '10 at 12:25
  • 1
    If you have a string in C# (or Java) that contains surrogate pairs (SPs are used to encode characters outside the normal two-byte range), each pair will count as two 16-bit characters, rather than as 1 Unicode code point. At least for indexing and length reporting purposes. – corvuscorax May 29 '10 at 12:46
3

UTF-16 can be more efficient for representing characters in some languages such as Chinese, Japanese and Korean where most characters can be represented in one 16 bit word. Some rarely used characters may require two 16 bit words. UTF-8 is generally much more efficient for representing characters from Western European character sets - UTF-8 and ASCII are equivalent over the ASCII range (0-127) - but less efficient with Asian languages, requiring three or four bytes to represent characters that can be represented with two bytes in UTF-16.

UTF-16 has an advantage as an in-memory format for Java/C# in that every character in the Basic Multilingual Plane can be represented in 16 bits (see Joe's answer) and some of the disadvantages of UTF-16 (e.g. confusing code relying on \0 terminators) are less relevant.

richj
  • 7,499
  • 3
  • 32
  • 50
3

If we're talking about plain text alone, UTF-16 can be more compact in some languages, Japanese (about 20%) and Chinese (about 40%) being prime examples. As soon as you're comparing HTML documents, the advantage goes completely the other way, since UTF-16 is going to waste a byte for every ASCII character.

As for simplicity or efficiency: if you implement Unicode correctly in an editor application, complexity will be similar because UTF-16 does not always encode codepoints as a single number anyway, and single codepoints are generally not the right way to segment text.

Given that in the most common applications, UTF-16 is less compact, and equally complex to implement, the singular reason to prefer UTF-16 over UTF-8 is if you have a completely closed ecosystem where you are regularly storing or transporting plain text entirely in complex writing systems, without compression.

After compression with zstd or LZMA2, even for 100% Chinese plain text, the advantage is completely wiped out; with gzip the UTF-16 advantage is about 4% on Chinese text with around 3000 unique graphemes.

2

For many (most?) applications, you will be dealing only with characters in the Basic Multilingual Plane, so can treat UTF-16 as a fixed-length encoding.

So you avoid all the complexity of variable-length encodings like UTF-8.

Joe
  • 122,218
  • 32
  • 205
  • 338
  • 3
    +1 in fact I think Unicode version 1 only had the basic, which is why a number of platforms assumed 16-bits would be the right size for a simple character data type. – Daniel Earwicker May 29 '10 at 11:46
  • 1
    "I think Unicode version 1 only had the basic" - yes that's true, more details here: http://en.wikipedia.org/wiki/UTF-16/UCS-2 – Joe May 29 '10 at 11:51
  • 7
    That's like saying "a lot of programs only care about ASCII, so can treat UTF-8 as a fixed-length encoding." – dan04 Aug 24 '10 at 04:29
  • 3
    You certainly cannot treat UTF-16 as a fixed-length encoding. Well, you CAN, but it would be WRONG, and it would FAIL under certain conditions. No text-transform functions are surrogate neutral: computing character length, changing case, swapping characters, reversing string, etc... all can cause character corruption if surrogate pairs are not accounted for. You cannot simply ignore part of the encoding rules because it's convenient or usually has no side effects. It's just incorrect. – Triynko Apr 13 '11 at 22:43
  • I am constantly being aggravated by new software I've found which made this decision and ends up with all kinds of problems when there's a single non-BMP character in the text. It's especially common in the UCS-2/UTF-16 universe centred around Windows, Java, C#, and JavaScript. – hippietrail May 30 '13 at 01:54