4

According to the Wikipedia article on UTF-16, "...[UTF-16] is also the only web-encoding incompatible with ASCII." (at the end of the abstract.) This statement refers to the HTML Standard. Is this a wrong statement?

I'm mainly a C# / .NET dev, and .NET as well as .NET Core uses UTF-16 internally to represent strings. I'm pretty certain that UTF-16 is a superset of ASCII, as I can easily write code that displays all ASCII characters:

public static void Main()
{
    for (byte currentAsciiCharacter = 0; currentAsciiCharacter < 128; currentAsciiCharacter++)
    {
        Console.WriteLine($"ASCII character {currentAsciiCharacter}: \"{(char) currentAsciiCharacter}\"");
    }
}

Sure, the control characters will mess up the console output, but I think my statement is clear: the lower 7 bits of a 16 bit char take the corresponding ASCII code point, while the upper 9 bits are zero. Thus UTF-16 should be a superset of ASCII in .NET.

I tried to find out why the HTML Standard says that UTF-16 is incompatible to ASCII, but it seems like they simply define it that way:

An ASCII-compatible encoding is any encoding that is not a UTF-16 encoding.

I couldn't find any explanations why UTF-16 is not compatible in their spec.

My detailed questions are:

  1. Is UTF-16 actually compatible to ASCII? Or did I miss something here?
  2. If it is compatible, why does the HTML Standard say it's not compatible? Maybe because of byte ordering?
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
feO2x
  • 5,358
  • 2
  • 37
  • 46
  • 8
    *UTF-16* is 2-byte *encoding* and is **not** a superset of ASCII. *Unicode* shares the first 128 code-points with ASCII. *UTF-8* is compatible with ASCII when only ASCII characters are used. Otherwise it encodes Unicode into multiple bytes (that are not 7-bit clean). https://en.wikipedia.org/wiki/Unicode – user2864740 May 17 '20 at 07:09
  • But the standard ASCII encoding is only 128 code points (7 bits)? Couldn't I say that these are completely part of UTF-16, and thus the latter is a superset? Sure, there are extended ASCII encodings, but I want to leave them out of the discussion. – feO2x May 17 '20 at 07:14
  • 3
    Unicode != UTF-16. Unicode has multiple encodings: UTF-8, UTF-16(LE/BE/Java), UTF-32, SCSU.. which can.. *encode* Unicode. UTF-16 is a multibyte encoding and is *not* compatible with the single-byte ASCII. A non-unicode aware program will, at best, display a NUL character between all encoded ASCII-range characters. – user2864740 May 17 '20 at 07:14
  • I know that unicode provides the code points and the different encodings tell how these code points are stored in bytes. My actual question is wether the lower 7 bits of a UTF-16 character are encoded differently than the corresponding bits in ASCII. Sometimes it's not that easy to phrase the right question... – feO2x May 17 '20 at 07:20
  • When a C# program writes a Unicode string to (stored as UTF-16 in-memory) to a file or stream, it will write this as the correct sequence of bytes per the TARGET encoding of the stream. This is why a C# program (UTF-16 in-memory strings) can write a UTF-8 file. – user2864740 May 17 '20 at 07:23

1 Answers1

6

ASCII is 7 bit encoding and stored in a single byte. UTF-16 uses 2 bytes chunks (ord) , which makes it right away incompatible. UTF-8 uses one byte chunk and for Latin alphabet matches with ASCII. IOW, UTF-8 is designed to be backward compatible with ASCII encoding.

kofemann
  • 4,217
  • 1
  • 34
  • 39
  • 2
    Thanks for pointing out the discrepancy between one byte and two byte length. But according to this argument, UTF-32 should also be incompatible? – feO2x May 17 '20 at 07:17
  • 1
    The quote limits to “web-encoding”, so any non “web-encoding” does not apply. The list of “web-encoding”, per the article, needs to be specified/referenced in order to remove ambiguity. – user2864740 May 17 '20 at 07:27