JSON, Unicode: a way to detect that XXXX in \uXXXX does not correspond to a Unicode character?

Question

The JSON specification says that a character may be escaped using this notation: \uXXXX (where XXXX are four hex digits)

However, not every set of four hex digits corresponds to a Unicode character.

Are there tools that can scan a JSON document to detect the presence of \uXXXX, where XXXX does not correspond to any Unicode character? More generally, how does one determine that \uXXXX does not correspond to any Unicode character?

score 0 · Answer 1 · edited Oct 07 '21 at 13:39

When the JSON spec talks about Unicode characters, it really means Unicode codepoints. Every valid \uXXXX sequence represents a valid codepoint, as \uXXXX can represent codepoints up to U+FFFF but Unicode defines codepoints all the way up to U+10FFFF.

When not using escaped hex notation, the full range of Unicode codepoints can be used as-is in JSON. On the other hand, when using escaped hex notation, only codepoints up to U+FFFF are allowed. This is OK though, because codepoints above U+FFFF must be represented using UTF-16 surrogate pairs, which consist of 2 codepoints that both fit in the \uXXXX range acting together. This is described in RFC 7159 Section 7 Strings:

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lower case. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

...

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

So your question should not be "does \uXXXX correspond to a Unicode character?", because it logically always will as all values 0x0000 - 0xFFFF are valid Unicode codepoints. The real question should be "does \uXXXX correspond to a Unicode codepoint in the BMP, and if not does it belong to a \uXXXX\uXXXX pair that corresponds to a valid UTF-16 surrogate?".

JSON, Unicode: a way to detect that XXXX in \uXXXX does not correspond to a Unicode character?

1 Answers1