8

I have a need to manipulate UTF-8 byte arrays in a low-level environment. The strings will be prefix-similar and kept in a container that exploits this (a trie.) To preserve this prefix-similarity as much as possible, I'd prefer to use a terminator at the end of my byte arrays, rather than (say) a byte-length prefix.

What terminator should I use? It seems 0xff is an illegal byte in all positions of any UTF-8 string, but perhaps someone knows concretely?

Kara
  • 6,115
  • 16
  • 50
  • 57
phs
  • 10,687
  • 4
  • 58
  • 84

3 Answers3

6

0xFF and 0xFE cannot appear in legal UTF-8 data. Also the bytes 0xF8-0xFD will only appear in the obsolete version of UTF-8 that allows up to six byte sequences.

0x00 is legal but won't appear anywhere except in the encoding of U+0000. This is exactly the same as other encodings, and the fact that it's legal in all these encodings never stopped it from being used as a terminator in C strings. I'd probably go with 0x00.

bames53
  • 86,085
  • 15
  • 179
  • 244
6

The byte 0xff cannot appear in a valid UTF-8 sequence, nor can any of 0xfc, 0xfd, 0xfe.

All UTF-8 bytes must match one of

0xxxxxxx - Lower 7 bit.
10xxxxxx - Second and subsequent bytes in a multi-byte sequence.
110xxxxx - First byte of a two-byte sequence.
1110xxxx - First byte of a three-byte sequence.
11110xxx - First byte of a four-byte sequence.
111110xx - First byte of a five-byte sequence.
1111110x - First byte of a six-byte sequence.

There are no seven or larger byte sequences. The latest version of UTF-8 only allows UTF-8 sequences up to 4 bytes in length, which would leave 0xf8-0xff unused, but is possible though that a byte sequence could be validly called UTF-8 according to an obsolete version and include octets in 0xf8-0xfb.

Community
  • 1
  • 1
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • Modern UTF-8 standards do not allow for 5-byte and 6-byte sequences anymore, as they encode codepoints that cannot be represented in UTF-16. RFC 3629 limited the max byte sequence to 4, and the Unicode standard adopted that limitation. – Remy Lebeau Jan 19 '12 at 01:36
  • @Remy Labeau, I think you are confusing UTF-8 with [CESU-8](http://www.unicode.org/reports/tr26/). "CESU-8 defines an encoding scheme for Unicode identical to UTF-8 except for its representation of supplementary characters. In CESU-8, supplementary characters are represented as six-byte sequences resulting from the transformation of each UTF-16 surrogate code unit into an eight-bit form similar to the UTF-8 transformation, but without first converting the input surrogate pairs to a scalar value." UTF-8 has not changed. – Mike Samuel Jan 19 '12 at 03:21
  • @RemyLebeau, or are you referring to the [RFC 3629](http://tools.ietf.org/html/rfc3629#section-5) update " Changes from RFC 2279: Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range)"? – Mike Samuel Jan 19 '12 at 03:30
  • Yes, that is what I am referring to. Neither RFC 3629 nor the official Unicode standard allow codepoints above U+10FFFF to be used with UTF-8, which means you can never have a valid UTF-8 sequence that is more than 4 bytes in length. – Remy Lebeau Jan 19 '12 at 23:11
  • @RemyLebeau-TeamB, Edited to add caveat. – Mike Samuel Jan 19 '12 at 23:44
  • @Anony-Mousse, it appears in UTF-8 just fine as the encoding for NUL. Java's UTF-8 is a variant which uses the 2-byte form for NUL, but that is not standard. – Mike Samuel Jan 20 '12 at 14:12
  • @Antony-Mousse, No. The byte 0 can appear in a valid UTF-8 byte string, so should not be used as an out-of-band separator. – Mike Samuel Jan 20 '12 at 14:54
  • @Anony-Mousse, the OPer wants to be able to mark where a sequence ends. That requires an out-of-band terminator. There is no in-band separator/terminator for UTF-8. – Mike Samuel Jan 20 '12 at 15:17
  • @Anony-Mousse, try writing code to correctly find the end of a UTF-8 byte sequence that is terminated with NUL. In python, `find_end("%s\x00" % s.encode("UTF-8")) == len(s)` for all unicode strings `s`. When you understand why that cannot be done, you will understand why it has to be out-of-band. – Mike Samuel Jan 20 '12 at 15:29
  • @Anony-Mousse, re "I cannot see this requirement", see his comment "\0 is also a legal ASCII encoding, and so a legal UTF-8 encoding of a code point. I wanted something explicitly not legal." – Mike Samuel Jan 20 '12 at 15:30
  • @Anony-Mousse, Why is what you care about important? If the problem specifies "valid UTF-8" the most robust design is one that assumes nothing above and beyond "valid UTF-8". Assuming "valid UTF-8" plus your favorite unstated assumptions is going to lead to brittle code. – Mike Samuel Jan 20 '12 at 15:51
  • @Anony-Mousse, Exactly. If what you are storing includes values that are not valid C strings, e.g. `"foo\0bar\0"`, then you do not get a robust design by assuming that you are storing C strings. – Mike Samuel Jan 20 '12 at 16:22
  • @Antony-Mousse, true. And storing only sequences of lower-case ASCII letters and numbers is even less error prone when anyone might touch your data might be confused as to encoding. But neither assumption is justified when your job is to store and compare UTF-8 byte sequences. – Mike Samuel Jan 20 '12 at 16:34
0

What about using one of the UTF-8 control characters?

You can choose one from http://www.utf8-chartable.de/

Ahmed Al Hafoudh
  • 8,281
  • 1
  • 18
  • 34