3

I'm trying to grasp UTF-8 encoding. The encoding of code points can range from 1 to 4 bytes. There are only 128 characters which encode to one byte. But since a byte has 8 bits it could encode 256(=2⁸) characters.

Therefore, why are only 128 characters encoded to one byte in UTF-8 and not 256 characters? I read Wikipedia.

AmigoJack
  • 5,234
  • 1
  • 15
  • 31
Piet Pro
  • 33
  • 3
  • If you use all 256 possible byte values to indicate a one-byte sequence, what byte value would be used to indicate a multi-byte sequence? – Mark Tolonen Mar 07 '23 at 23:40
  • Using all 8 bit to encode 256 different characters is what all the old fixed-width encodings do. But they are limited to at most 256 different characters due to this, since there's no more values to indicate "this byte starts a multi-byte sequence". – Joachim Sauer Mar 08 '23 at 15:27
  • "*I read Wikipedia*" - if you actually had, then you would have already known the answer to your question, as it clearly explains how the bit patterns work in every byte of a UTF-8 sequence. Some bits are reserved for UTF-8's encoding, the remaining bits are to hold each codepoint value. – Remy Lebeau Mar 22 '23 at 00:23

1 Answers1

1

One bit is used to indicate a multi-byte sequence. Removing a single bit from 2^8 leaves 2^7 (128).

The number of leading 1 bits in a sequence indicates the length of the sequence. A leading 0 bit indicates this is a one-byte sequence, but that 0 means there's only 7 bits left for the code point data.

A leading 110 indicates this is a two-byte sequence. Continuation bytes begin with 10. So a 2-byte sequence can encode 11 (16-3-2) bits.

A leading 1110 indicates a three-byte sequence which encodes 16 (24-4-2*2) bits.

And a leading 11110 indicates a four-byte sequence encoding 21 (32-5-2*3) bits. Unicode code points are defined as 21 bits, so that's enough. UTF-8 originally supported up to 6-byte sequences, but was restricted to 4-bytes when Unicode was restricted to 21 bits (to stay compatible with UTF-16).

You may notice that one-byte and continuation bytes are "backwards." To be consistent, it would make sense for single-byte sequences to begin with a 10 and continuation bytes to begin with a 0. But this would break ASCII compatibility (which is a huge advantage of UTF-8) and also would reduce the number of 1-byte encodings to 64, which would be extremely inefficient. So a small inconsistency is added to great advantage.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Rob Napier
  • 286,113
  • 34
  • 456
  • 610