What issues would come from treating UTF-16 as a fixed 16-bit encoding?

Question

I was reading a few questions on SO about Unicode and there were some comments I didn't fully understand, like this one:

Dean Harding: UTF-8 is a variable-length encoding, which is more complex to process than a fixed-length encoding. Also, see my comments on Gumbo's answer: basically, combining characters exist in all encodings (UTF-8, UTF-16 & UTF-32) and they require special handling. You can use the same special handling that you use for combining characters to also handle surrogate pairs in UTF-16, so for the most part you can ignore surrogates and treat UTF-16 just like a fixed encoding.

I've a little confused by the last part ("for the most part"). If UTF-16 is treated as fixed 16-bit encoding, what issues could this cause? What are the chances that there are characters outside of the BMP? If there are, what issues could this cause if you'd assumed two-byte characters?

I read the Wikipedia info on Surrogates but it didn't really make things any clearer to me!

Edit: I guess what I really mean is "Why would anyone suggest treating UTF-16 as fixed encoding when it seems bogus?"

Edit2:

I found another comment in "Is there any reason to prefer UTF-16 over UTF-8?" which I think explains this a little better:

Andrew Russell: For performance: UTF-8 is much harder to decode than UTF-16. In UTF-16 characters are either a Basic Multilingual Plane character (2 bytes) or a Surrogate Pair (4 bytes). UTF-8 characters can be anywhere between 1 and 4 bytes

This suggests the point being made was that UTF-16 would not have any three-byte characters, so by assuming 16bits, you wouldn't "totally screw up" by ending up one-byte off. But I'm still not convinced this is any different to assuming UTF-8 is single-byte characters!

As a side note, older versions of MS-Windows (before Windows 2000), used UCS-2. So the surrogates (code points between `\xD800` and `\xDFFF`) just did not represent anything. These are still not characters today (whether encoded in UTF-8 or UTF-32). JavaScript still uses UTF-16... — Alexis Wilke, Jan 02 '23 at 00:24

9000 · Accepted Answer · 2011-02-21T15:18:37.923

UTF-16 includes all "base plane" characters. The BMP covers most of the current writing systems, and includes many older characters that one can practically encounter. Take a look at them and decide whether you really are going to encounter any characters from the extended planes: cuneiform, alchemical symbols, etc. Few people will really miss them.

If you still encounter characters that require extended planes, these are encoded by two code points (surrogates), and you'll see two empty squares or question marks instead of such a non-character. UTF is self-synchronizing, so a part of a surrogate character never looks like a legitimate character. This allows things like string searches to work even if surrogates are present and you don't handle them.

Thus issues arising from treating UTF-16 as effectively USC-2 are minimal, aside from the fact that you don't handle the extended characters.

EDIT: Unicode uses 'combining marks' that render at the space of previous character, like accents, tilde, circumflex, etc. Sometimes a combination of a diacritic mark with a letter can be represented as a distinct code point, e.g. á can be represented as a single \u00e1 instead of a plain 'a' + accent which are \u0061\u0301. Still you can't represent unusual combinations like z̃ as one code point. This makes search and splitting algorithms a bit more complex. If you somehow make your string data uniform (e.g. only using plain letters and combining marks), search and splitting become simple again, but anyway you lose the 'one position is one character' property. A symmetrical problem happens if you're seriously into typesetting and want to explicitly store ligatures like ﬁ or ﬄ where one code point corresponds to 2 or 3 characters. This is not a UTF issue, it's an issue of Unicode in general, AFAICT.

+1 for a very practical answer. I also agree that in many cases, the extra effort of supporting utf32 or true variable-width utf16 is not worth the effort. Users that will be upset are already used to working around this. Simplicity is one of the main reasons we use Unicode anyway. — tenfour, Feb 21 '11 at 14:10
That depends on what kind of strings you were using to begin with. UTF-16 is the easiest upgrade path from UCS-2, but the hardest upgrade path from 8-bit characters. — dan04, Feb 23 '11 at 00:56

score 3 · Answer 2 · answered Feb 21 '11 at 14:06

It is important to understand that even UTF-32 is fixed-length when it comes to code points, not characters. There are many characters that are composed from multiple code points, and therefore you can't really have a Unicode encoding where one number (code unit) corresponds to one character (as perceived by users).

To answer your question - the most obvious issue with treating UTF-16 as fixed-length encoding form would be to break a string in a middle of a surrogate pair so you get two invalid code points. It all really depends what you are doing with the text.

score 3 · Answer 3 · answered Feb 22 '11 at 13:56

I guess what I really mean is "Why would anyone suggest treating UTF-16 as fixed encoding when it seems bogus?"

Two words: Backwards compatibility.

Unicode was originally intended to use a fixed-width 16-bit encoding (UCS-2), which is why early adopters of Unicode (e.g., Sun with Java and Microsoft with Windows NT), used a 16-bit character type. When it turned out that 65,536 characters wasn't enough for everyone, UTF-16 was developed in order to allow this 16-bit character systems to represent the 16 new "planes".

This meant that characters were no longer fixed-width, so people created the rationalization that "that's OK because UTF-16 is almost fixed width."

But I'm still not convinced this is any different to assuming UTF-8 is single-byte characters!

Strictly speaking, it's not any different. You'll get incorrect results for things like "\uD801\uDC00".lower().

However, assuming UTF-16 is fixed width is less likely to break than assuming UTF-8 is fixed-width. Non-ASCII characters are very common in languages other than English, but non-BMP characters are very rare.

You can use the same special handling that you use for combining characters to also handle surrogate pairs in UTF-16

I don't know what he's talking about. Combining sequences, whose constituent characters have an individual identity, are nothing at all like surrogate characters, which are only meaningful in pairs.

In particular, the characters within a combining sequence can be converted to a different encoding form one characters at a time.

>>> 'a'.encode('UTF-8') + '\u0301'.encode('UTF-8')
b'a\xcc\x81'

But not surrogates:

>>> '\uD801'.encode('UTF-8') + '\uDC00'.encode('UTF-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 0: surrogates not allowed

Thanks for the additional info. This bit, in particular: "However, assuming UTF-16 is fixed width is less likely to break than assuming UTF-8 is fixed-width. Non-ASCII characters are very common in languages other than English, but non-BMP characters are very rare.". I think the idea was "If you're *really* going to be lazy, at least do it in a way that's *far* less likely to cause issues" — Danny Tuppeny, Feb 22 '11 at 18:07
OTOH, one could convincingly argue that a 50% correct program is closer to being fixed than a 99% correct program. — dan04, Feb 23 '11 at 00:45
"Combining sequences, whose constituent characters have an individual identity, are nothing at all like surrogate characters, which are only meaningful in pairs."—Conceptually that's correct, but practically the distinction often doesn't matter. For example, a text rendering library can simply treat surrogate pairs as ligatures (essentially rendering a sequence of code units as one glyph), even if these are totally different concepts. Such "dirty tricks" are even mentioned in the Unicode standard, so there's absolutely nothing wrong with the approach *if* it leads to the correct results. — Philipp, Feb 24 '11 at 14:44

score 1 · Answer 4 · answered Feb 21 '11 at 13:32

1

UTF-16 is a variable-length encoding. The older UCS-2 is not. If you treat a variable-length encoding like fixed (constant length) you risk introducing error whenever you use "number of 16-bit numbers" to mean "number of characters", since the number of characters might actually be less than the number of 16-bit quantities.

answered Feb 21 '11 at 13:32

unwind

391,730
64
469
606

I guess I worded the question badly. What I'm really trying to understand is why people would suggest treating UTF-16 as fixed encoding, when it seems like this will introduce issues. – Danny Tuppeny Feb 21 '11 at 13:38

score 0 · Answer 5 · answered Jan 02 '23 at 01:26

The Unicode standard has changed several times along the way. For example, UCS-2 is not a valid encoding anymore. It has been deprecated for a while now.

As mentioned by user 9000, even in UTF-32, you have sequences of characters that are interdependent. The à is a good example, although this character can be canonicalized to \x00E1. So you can make it simple.

Unicode, even when using the UTF-32 encoding, supports up to 30 code points, one after the other, to represent the most complex characters. (The existing characters do not use that many, I think the longest in existence is currently 17 if I'm correct.)

For that reason, Unicode developed Normalization Forms. It actually considers five different forms:

Unnormalized -- a sequence you create manually, for example; text editors are expected to save properly normalized (NFC) code sequences
NFD -- Normalization Form Decomposition
NFKD -- Normalization Form Compatibility Decomposition
NFC -- Normalization Form Canonical Composition
NFKC -- Normalization Form Compatibility Canonical Composition

Although in most situations it does not matter much because long compositions are rare, even in languages that use them.

And in most cases, your code already deals with canonical compositions. However, if you create strings manually in your code, you are not unlikely to create an unnormalized string (assuming you use such long forms).

Properly implemented servers on the Internet are expected to refused strings that are not canonical compositions as per Unicode. Long forms are also forbidden over connections. For example, the UTF-8 encoding technically allows for ASCII characters to be encoded using 1, 2, 3, or 4 bytes (and the old encoding allowed up to 6 bytes!) but those encoding are not permitted.

Any comment on the Internet that contradicts the Unicode Normalization Form document is simply incorrect.

What issues would come from treating UTF-16 as a fixed 16-bit encoding?

5 Answers5