The standard seems to have two different responses to char16_t
literals containing a character that can't be represented by char16_t
. First of all, if the code point value can't be represented in 16 bits (i.e. it is not in the basic multilingual plane (BMP)), then the program is ill-formed (§2.14.3/2):
The value of a
char16_t
literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual plane code point.) If the value is not representable within 16 bits, the program is ill-formed.
Since \U0001ABCD
is a single c-char1 but is not in the BMP, a program containing it is ill-formed.
Okay, but later on in the same chapter, it says that if the value falls outside the implementation-defined range of char16_t
then the literal has an implementation-defined value (§2.14.3/4):
The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for [...]
char16_t
(for literals prefixed by ’u
’) [...]
Since the implementation-defined range for char16_t
must be at least 16 bits (to be able to store the entire BMP), we already know that the program is ill-formed for a value that falls outside that range. Why does the standard bother giving it an implementation-defined value?
1 The production rules are c-char -> universal-character-name -> \U
hex-quad hex-quad