Are 6 octet UTF-8 sequences valid?

Question

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be encoded? I'm getting conflicting standards. I need to be able to support every Unicode character, not just those in the U+0000..U+10FFFF range.

(All quotes are from RFC 3629)

Section 3:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the number of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded.

So not all possible characters can be encoded with UTF-8? Does this mean I cannot encode characters from different planes than the BMP?

Section 2:

The octet values C0, C1, F5 to FF never appear.

This means we cannot encode UTF-8 values with 5 or 6 octets (or even some with 4 that aren't within the above range)?

Section 12:

Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range).

Looking at the previous RFC confirms this...they reduced the range of characters.

Section 10:

Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences.

So these sequences are allowed per the ISO/IEC 10646 definition, but not the RFC 3629 definition? Which one should I follow?

Thanks in advance.

score 9 · Accepted Answer · edited Apr 20 '20 at 03:29

9

They are no Unicode characters beyond 10FFFF, the BMP covers 0000 through FFFF.

UTF-8 is well-defined for 0-10FFFF.

edited Apr 20 '20 at 03:29

xmllmx

39,765
26
162
323

answered Aug 24 '10 at 17:32

devio

36,858
7
80
143

2

Thanks, that makes sense. Does this mean I only need to worry about UTF-8 sequences longer than 4 octets, with anything longer being an error? – Patrick Niedzielski Aug 24 '10 at 20:23
1

@PatrickNiedzielski Yes, but you must treat them as an error (`MUST`). – EKons Aug 27 '16 at 17:37
1

@devio, What about in future versions of Unicode when they expand it? – Pacerier Mar 20 '17 at 09:20
1

Planes 3–13 are still unassigned. I guess we should not worry ;) https://en.wikipedia.org/wiki/Plane_(Unicode) – devio Mar 20 '17 at 10:06

score 2 · Answer 2 · edited Aug 25 '10 at 00:58

2

Both UTF-8 and UTF-16 allow all Unicode characters to be encoded. What UTF-8 is not allowed to do is to encode upper and lower surrogate halves (which UTF-16 uses) or values above U+10FFFF, which aren't legal Unicode.

Note that the BMP ends at U+FFFF.

edited Aug 25 '10 at 00:58

dan04

87,747
23
163
198

answered Aug 24 '10 at 17:36

chryss

7,459
37
46

score 2 · Answer 3 · answered Aug 31 '10 at 03:07

2

I would have to say no: Unicode code points are valid for the range [0, 0x10FFFF], and those map to 1-4 octets. So, if you did come across a 5- or 6-octet UTF-8 encoded code point, it's not a valid code point - there's certainly nothing assigned there. I am a little baffled as to why they're there in the ISO standard - I couldn't find an explanation.

It does make you wonder, however, if perhaps someday in the future, they would expand past U+10FFFF. 0x10FFFF allows for over a million characters, but there are a lot characters out there, and it would depend how much eventually gets encoded. (For sanity's sake, let's hope not, a million characters is a lot!) UTF-32 could handle more code points, and as you've discovered, UTF-8 could. It'd really be UTF-16 that's out of luck - more surrogate pairs would be needed somewhere in the spectrum of code points.

answered Aug 31 '10 at 03:07

Thanatos

42,585
14
91
146

2

The ISO had originally intended to introduce their own 31-bit character encoding. UTF-8 was designed around that possibility. – dan04 Aug 31 '10 at 03:39
2

To me, it seems Unicode is trying to fill up the rest of the codepoints...that they have more than they know what to do with. Example: there is a block for Mahjong playing blocks. However, there certainly are some useful characters outside the BMP that I need to support. Most of them are rubbish, though. It makes me wonder why they didn't accept Klingon characters a while back. – Patrick Niedzielski Sep 01 '10 at 15:22
@dan04: Quite so. That’s why you can have abstract characters of much higher code points than 0x10_FFFF is you aren’t using them for UTF interchange. (Sometimes these are called *supers* or *supras*.) For example, `perl -le 'print ord chr(0xFFF_FFFF_FFFF)'` prints `17592186044415`. This can be quite handy. – tchrist Feb 13 '11 at 19:30

Are 6 octet UTF-8 sequences valid?

3 Answers3

Linked