0

The questions below are about Character sets (C11, 5.2.1 Character sets) and mapping (C11, 5.1.1.2 Translation phases, 1).

The list:

  1. Can a source character set as an extension include control characters, representing other than horizontal tab, vertical tab, and form feed? If yes, then does a diagnostic need to be produced when using such control characters in e.g. string literal?

    Example: GCC/LLVM/MSVC support many control characters in a string literal w/o issuing a diagnostic AND they keep such control characters in the string literal after the mapping at the translation phase 1 is done. (Meaning that GCC/LLVM/MSVC support these control characters in the source character set.) Is it OK that diagnostic is not produced?

Demo:

# GCC
# test \x00
$ echo "char x[] = \"xxx\"; int s = sizeof x;" > t999.c ;\
printf '\x00' | dd of=t999.c bs=1 seek=13 count=1 conv=notrunc ;\
gcc t999.c -c -std=c11 -pedantic -Wall -Wextra -S ;\
grep 's:' t999.S -A1
t999.c:1:12: warning: null character(s) preserved in literal
    1 | char x[] = "x x"; int s = sizeof x;
      |            ^
s:
        .long   4
# here we see that a diagnostic is produced, sizeof x is 4

# test \x01
$ echo "char x[] = \"xxx\"; int s = sizeof x;" > t999.c ;\
printf '\x01' | dd of=t999.c bs=1 seek=13 count=1 conv=notrunc ;\
gcc t999.c -c -std=c11 -pedantic -Wall -Wextra -S ;\
grep 's:' t999.S -A1
s:
        .long   4
# here we see that no diagnostic is produced, sizeof x is 4

# MSVC
# test \x00
# see below

# test \x01
$ echo "char x[] = \"xxx\"; int s = sizeof x;" > t999.c ;\
printf '\x01' | dd of=t999.c bs=1 seek=13 count=1 conv=notrunc ;\
cl t999.c /c /std:c11 /FA /nologo ;\
grep -P '^s' t999.asm
s       DD      04H
# here we see that no diagnostic is produced, sizeof x is 4
  1. C11, 5.1.1.2 Translation phases, 1:

Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.

A simple question: is "mapping to nothing" still a mapping? E.g. X => <nothing>. Or perhaps it is not a "mapping", but "skipping" (or "removal")? Example: in "x<null>y" (in binary 22 78 00 79 22) MSVC skips/removes null character w/o producing a diagnostic (making sizeof produce 3 instead of 4). Is it OK?

Demo:

# MSVC
# test \x00
$ echo "char x[] = \"xxx\"; int s = sizeof x;" > t999.c ;\
printf '\x00' | dd of=t999.c bs=1 seek=13 count=1 conv=notrunc ;\
cl t999.c /c /std:c11 /FA /nologo ;\
grep -P '^s' t999.asm
s       DD      03H
# here we see that no diagnostic is produced, sizeof x is 3
pmor
  • 5,392
  • 4
  • 17
  • 36
  • 1
    Why would you need a diagnostic for a valid character? 6.4.5/1 lets any source character other than backslash, double quote and newline be present in a string without being escaped, and the restriction in 5.2.1/3 doesn't apply to string literals and character constants. – rici Apr 11 '22 at 23:09
  • The 6.4.5/1 contains "any member of the _source character set_". Hence, the question is whether e.g. `0x01` (Start of Header, SOH) is allowed to be part of the source character set. The SOH is not part of the required source character set (C11, 5.2.1 Character sets, 3). Can the SOH be supported as an extension? If yes, then does support of this extension require (or imply) producing a diagnostic? E.g. `SOH character(s) preserved in literal`. – pmor Apr 12 '22 at 15:56
  • The *source character set* consists of the *basic source character set* and "zero or more locale-specific characters" (5.2.1/1). The only requirements placed on the extended characters are that (1) they are not in the basic character set (5.2.1/1) and (2) that if they are multibyte characters, none of the bytes are 0 (5.2.1.2/1). There are no other restrictions, and the semantics of these characters is implementation-defined. What do you think would preclude SOH (or any other character) from being in the source character set? – rici Apr 12 '22 at 17:31
  • Yes, if "there are no other restrictions", then the SOH (`0x01`) is classified as a member of extended character set. Thanks! – pmor Apr 12 '22 at 18:29
  • What confuses me it the "superset": "NOTE The extended character set is a _superset_ of the basic character set" while later it says "... a set of zero or more locale-specific members (which are _not members of the basic character set_) called extended characters". Per math: a set `A` is a superset of another set `B` if all elements of the set `B` are elements of the set `A`. – pmor Apr 12 '22 at 18:38
  • there's a notorious ambiguity in English noun phrases, one famous example of which is the phrase: "the little white boys' school". How do we interpret that? Does it refer to the scale of the building or its students? Is the school racist or boringly painted? In the case of C character sets, there are extended characters, distinct from basic characters, but the extended character set is an extended set of characters, not a set of extended characters. That's made explicit in 5.2.1/1. So there's no contradiction with mathematics. – rici Apr 12 '22 at 21:33

0 Answers0