3

The standard seems to have two different responses to char16_t literals containing a character that can't be represented by char16_t. First of all, if the code point value can't be represented in 16 bits (i.e. it is not in the basic multilingual plane (BMP)), then the program is ill-formed (§2.14.3/2):

The value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit. (That is, provided it is a basic multi-lingual plane code point.) If the value is not representable within 16 bits, the program is ill-formed.

Since \U0001ABCD is a single c-char1 but is not in the BMP, a program containing it is ill-formed.

Okay, but later on in the same chapter, it says that if the value falls outside the implementation-defined range of char16_t then the literal has an implementation-defined value (§2.14.3/4):

The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for [...] char16_t (for literals prefixed by ’u’) [...]

Since the implementation-defined range for char16_t must be at least 16 bits (to be able to store the entire BMP), we already know that the program is ill-formed for a value that falls outside that range. Why does the standard bother giving it an implementation-defined value?

1 The production rules are c-char -> universal-character-name -> \U hex-quad hex-quad

Joseph Mansfield
  • 108,238
  • 20
  • 242
  • 324
  • Interestingly, gcc 4.7 compiles it fine. Just the warning: "character constant too long for its type [enabled by default]" – Joseph Mansfield Nov 25 '12 at 20:06
  • 1
    So to be clear, `char16_t c = u'\U0001ABCD';` is ill-formed, but `char16_t s[] = u"\U0001ABCD";` is not, agreed? – Kerrek SB Nov 25 '12 at 20:10
  • 1
    @KerrekSB Agreed. Specifically the the *character* literals are ill-formed. – Joseph Mansfield Nov 25 '12 at 20:11
  • **See also:** http://stackoverflow.com/questions/13547368/is-u0b95-a-multicharacter-literal – Lightness Races in Orbit Nov 25 '12 at 21:11
  • gcc's behavior in the past has been strange. Due to the requirement that UCNs behave the same as literal characters they made UCNs behave the same as UTF-8 sequences. And the behavior of UTF-8 sequences hadn't been deliberately designed; it just fell out of the implementation. http://ideone.com/9cg69P. IMHO clang's behavor makes much more sense (although maybe gcc 4.7 has fixed all the previous issues.) – bames53 Nov 25 '12 at 21:12
  • @LightnessRacesinOrbit Yeah, that's my other question. – Joseph Mansfield Nov 25 '12 at 21:15

1 Answers1

0

The program is ill-formed as per 2.14.3/2, which means the error must be diagnosed. There's no need to analyze further, because implementations are not required to finish compiling or to produce an executable. The literal may be considered to still have a value, but it hardly matters.

(Although implementations are permitted to compile and execute ill-formed programs. So I suppose in that case the fact that the character literal is still specified to have a value would matter.)

bames53
  • 86,085
  • 15
  • 179
  • 244
  • Are you sure? I used the fact that it's not to answer [my own question from yesterday](http://stackoverflow.com/q/13547368/150634). I figured it was just tucked into a silly paragraph. – Joseph Mansfield Nov 25 '12 at 20:11