9

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF. Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I learned these bytes are called surrogate characters in UTF-8. What are thay and which characters they actually correspond to? I read in several places I still don't understand what they are.

Gherman
  • 6,768
  • 10
  • 48
  • 75
  • 1
    The surrogate code points are used in UTF-16 to represent code points beyond `FFFF`. They are used in pairs, so a character is made of 4 bytes. This mechanism is not needed in UTF-8, so text encoded with UTF-8 shouldn't contain them. However, it's possible to encode the surrogate code points in UTF-8, so it makes sense for a validation routine to identify them. – lenz Jun 24 '18 at 09:10
  • @lenz Can I just say that characters within `DC00-DFFF` are not valid UTF-8 characters? Is that so? Can they appear in a domain name? – Gherman Jun 26 '18 at 09:23
  • Errr... I know they shouldn't be there, but I can't tell you which standard is violated how badly if they are. I don't know about the domain name either, sorry. – lenz Jun 26 '18 at 09:47

1 Answers1

23

What are surrogate characters in UTF-8?

This is almost like a trick question.

Approximate answer #1: 4 bytes (if paired and encoded in UTF-8).

Approximate answer #2: Invalid (if not paired).

Approximate answer #3: It's not UTF-8; It's Modified UTF-8.

Synopsis: The term doesn't apply to UTF-8.

Unicode codepoints have a range that needs 21 bits of data.

UTF-16 code units are 16 bits. UTF-16 encodes some ranges of Unicode codepoints as one code unit and others as pairs of two code units, the first from a "high" range, the second from a "low" range. Unicode reserves the codepoints that match the ranges of the high and low pairs as invalid. They are sometimes called surrogates but they are not characters. They don't mean anything by themselves.

UTF-8 code units are 8 bits. UTF-8 encodes several distinct ranges of codepoints in one to four code units, respectively.

#1 It happens that the codepoints that UTF-16 encodes with two 16-bit code units, UTF-8 encodes with 4 8-bit code units, and vice versa.

#2 You can apply the UTF-8 encoding algorithm to the invalid codepoints, which is invalid. They can't be decoded to a valid codepoint. A compliant reader would throw an exception or throw out the bytes and insert a replacement character (�).

#3 Java provides a way of implementing functions in external code with a system called JNI. The Java String API provides access to String and char as UTF-16 code units. In certain places in JNI, presumably as a convenience, string values are modified UTF-8. Modified UTF-8 is the UTF-8 encoding algorithm applied to UTF-16 code units instead of Unicode codepoints.

Regardless, the fundamental rule of character encodings is to read with the encoding that was used to write. If any sequence of bytes is to be considered text, you must know the encoding; Otherwise, you have data loss.

Tom Blodget
  • 20,260
  • 3
  • 39
  • 72