Is UTF-16 a superset of ASCII? If yes, why is UTF-16 incompatible with ASCII according to the HTML Standard?

Question

According to the Wikipedia article on UTF-16, "...[UTF-16] is also the only web-encoding incompatible with ASCII." (at the end of the abstract.) This statement refers to the HTML Standard. Is this a wrong statement?

I'm mainly a C# / .NET dev, and .NET as well as .NET Core uses UTF-16 internally to represent strings. I'm pretty certain that UTF-16 is a superset of ASCII, as I can easily write code that displays all ASCII characters:

public static void Main()
{
    for (byte currentAsciiCharacter = 0; currentAsciiCharacter < 128; currentAsciiCharacter++)
    {
        Console.WriteLine($"ASCII character {currentAsciiCharacter}: \"{(char) currentAsciiCharacter}\"");
    }
}

Sure, the control characters will mess up the console output, but I think my statement is clear: the lower 7 bits of a 16 bit char take the corresponding ASCII code point, while the upper 9 bits are zero. Thus UTF-16 should be a superset of ASCII in .NET.

I tried to find out why the HTML Standard says that UTF-16 is incompatible to ASCII, but it seems like they simply define it that way:

An ASCII-compatible encoding is any encoding that is not a UTF-16 encoding.

I couldn't find any explanations why UTF-16 is not compatible in their spec.

My detailed questions are:

Is UTF-16 actually compatible to ASCII? Or did I miss something here?
If it is compatible, why does the HTML Standard say it's not compatible? Maybe because of byte ordering?

*UTF-16* is 2-byte *encoding* and is **not** a superset of ASCII. *Unicode* shares the first 128 code-points with ASCII. *UTF-8* is compatible with ASCII when only ASCII characters are used. Otherwise it encodes Unicode into multiple bytes (that are not 7-bit clean). https://en.wikipedia.org/wiki/Unicode — user2864740, May 17 '20 at 07:09
But the standard ASCII encoding is only 128 code points (7 bits)? Couldn't I say that these are completely part of UTF-16, and thus the latter is a superset? Sure, there are extended ASCII encodings, but I want to leave them out of the discussion. — feO2x, May 17 '20 at 07:14
Unicode != UTF-16. Unicode has multiple encodings: UTF-8, UTF-16(LE/BE/Java), UTF-32, SCSU.. which can.. *encode* Unicode. UTF-16 is a multibyte encoding and is *not* compatible with the single-byte ASCII. A non-unicode aware program will, at best, display a NUL character between all encoded ASCII-range characters. — user2864740, May 17 '20 at 07:14
I know that unicode provides the code points and the different encodings tell how these code points are stored in bytes. My actual question is wether the lower 7 bits of a UTF-16 character are encoded differently than the corresponding bits in ASCII. Sometimes it's not that easy to phrase the right question... — feO2x, May 17 '20 at 07:20
When a C# program writes a Unicode string to (stored as UTF-16 in-memory) to a file or stream, it will write this as the correct sequence of bytes per the TARGET encoding of the stream. This is why a C# program (UTF-16 in-memory strings) can write a UTF-8 file. — user2864740, May 17 '20 at 07:23

score 6 · Accepted Answer · answered May 17 '20 at 07:14

6

ASCII is 7 bit encoding and stored in a single byte. UTF-16 uses 2 bytes chunks (ord) , which makes it right away incompatible. UTF-8 uses one byte chunk and for Latin alphabet matches with ASCII. IOW, UTF-8 is designed to be backward compatible with ASCII encoding.

answered May 17 '20 at 07:14

kofemann

4,217
1
34
39

2

Thanks for pointing out the discrepancy between one byte and two byte length. But according to this argument, UTF-32 should also be incompatible? – feO2x May 17 '20 at 07:17
1

The quote limits to “web-encoding”, so any non “web-encoding” does not apply. The list of “web-encoding”, per the article, needs to be specified/referenced in order to remove ambiguity. – user2864740 May 17 '20 at 07:27

Is UTF-16 a superset of ASCII? If yes, why is UTF-16 incompatible with ASCII according to the HTML Standard?

1 Answers1