23

Even though an oldtimer, I fear I do not (anymore) have a complete grasp of parsing of constants in C. The second of the following 1-liners fails to compile:

int main( void ) { return (0xe +2); }
int main( void ) { return (0xe+2); }

$ gcc -s weird.c

weird.c: In function ‘main’:
weird.c:1:28: error: invalid suffix "+2" on integer constant
int main( void ) { return (0xe+2); }
                           ^

The reason for the compilation failure is probably that 0xe+2 is parsed as a hexadecimal floating point constant as per C11 standard clause 6.4.4.2. My question is whether a convention exists to write simple additions of hexadecimal and decimal numbers in C, I do not like to have to rely on white space in parsing.

This was with gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9). Stopping compiling after preprocessing (-E) show that the compilation failure happens in gcc not cpp.

Lundin
  • 195,001
  • 40
  • 254
  • 396
Baard
  • 809
  • 10
  • 26
  • 1
    Only guessing, so not as an answer: You might have to stay away from anything looking like a float. I think your code piece fits in this http://en.cppreference.com/w/cpp/language/floating_literal – Yunnosch Apr 11 '18 at 06:21
  • MS documentation, but I think it should be generally applicable to gcc and it has lots of examples: https://msdn.microsoft.com/en-us/library/w9bk1wcy.aspx – zzxyz Apr 11 '18 at 06:26
  • Here's an answer which mentions this odd corner of the language: https://stackoverflow.com/a/41701152/1566221 – rici Apr 11 '18 at 06:47
  • Apparently this is the reason for the error https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3885. See C11 6.4.8. I don't know enough of these obscure things to write an answer though. – Lundin Apr 11 '18 at 07:10
  • @Lundin, I thought translation phase 7 was covered by the preprocessor? Bug or not, this was not my question. I only try to find a good way to write such a sum as 0x4e+33. It seems a bit over the top to write (0x4e)+33, so probably rici is right, whitespace *is* the convention. – Baard Apr 11 '18 at 07:17
  • You already rely on whitespace to get around things like `*p = 2; x = x/*p;`. This shouldn't be any different. – Hong Ooi Apr 11 '18 at 09:20
  • @Baard Fine, I posted an answer that covers the actual question, rather than the reason why. – Lundin Apr 11 '18 at 09:47
  • Just another suggestion, because it seems nobody else has mentioned it - you could rewrite it as `2+0xe` to avoid the ambiguity. Whitespace would still be good, but it's not mandatory in this case... – twalberg Apr 11 '18 at 14:17
  • @Baard Translation phases 1 through 4 are definitely part of the preprocessor, and translation phase 7 is definitely part of the compiler proper. Phases 5 and 6 could be either, but _historically_, in implementations where the preprocessor and the compiler proper were two separate programs, they were part of the compiler proper. Phase 8 is the linker. – zwol Apr 11 '18 at 17:14
  • @Baard Most modern C compilers implement phases 1 through 7 all in one program, but you can still usefully draw a line right after phase 4 because that's the phase in which preprocessing directives are executed. – zwol Apr 11 '18 at 17:15

2 Answers2

20

Because GCC thinks that 0xe+2 is a floating point number, while this is just an addition of two integers.

According to cppreference:

Due to maximal munch, hexadecimal integer constants ending in e and E, when followed by the operators + or -, must be separated from the operator with whitespace or parentheses in the source:

int x = 0xE+2;   // error
int y = 0xa+2;   // OK
int z = 0xE +2;  // OK
int q = (0xE)+2; // OK
msc
  • 33,420
  • 29
  • 119
  • 214
  • There seems to be a bug in the gcc message "error: invalid suffix on integer constant". If 0xe was an integer constant then gcc shouldn't complain. – Lundin Apr 11 '18 at 06:30
  • I understand *why* the compilation fails, this is clearly stated in ISO/IEC 9899:201x clause 6.4.4.2. My question is if there is a programming convention to use to get around reliance on white-space in a sum of constants with mixed types. – Baard Apr 11 '18 at 06:44
  • 6
    @baard: whitespace *is* the convention. – rici Apr 11 '18 at 06:47
  • 1
    @Baard It is actually impossible to understand why compilation fails by reading chapter 6.4.4 alone. – Lundin Apr 11 '18 at 06:51
  • 1
    @Baard What did whitespace do to you? The only disadvantage I can see of using whitespace is if you are using an IDE with buggy formatting, and you configure it to remove spaces around operators and format the file. A good IDE will not remove space where it is necessary, as in this case, a bad one will just apply the rule and cause an error (but the error is immediately identified and fixed easily so it's not a big deal...). However parenthesis can fail for the same reason: when asked to remove unnecessary parenthesis if it does not take into account this special case you are screwed again. – Giacomo Alzetta Apr 11 '18 at 09:31
  • Why would `0x...e+N` be interpreted as a floating-point number? Hexfloat literals (since C++17 or GNU++98) use `p` or `P` to indicate beginning of exponent, so this explanation doesn't seem complete. – Ruslan Apr 11 '18 at 10:04
  • 1
    @Lundin "0xe" is a valid integer constant, with no suffix. "0xeL" is a valid integer constant with a valid suffix, "L". It is not _that_ much of a stretch to describe "0xe+2" as the integer constant "0xe" again, but with the _invalid_ suffix "+2". I agree that the error message could be better, though. – zwol Apr 11 '18 at 13:20
  • @zwol After digging deeper into this, the issue seems to be that gcc doesn't know what it is parsing at that point. It is trying to build up a "token" such as a constant, which in turn consists of "preprocessing-token". The pre-processor fails to build up the "preprocessing-token" called "pp-number" and then gcc gives this error message when the pp-number can't be translated to a valid token. This happens at an earlier point before it can deduct what kind of constant it is dealing with. So I suppose it should rather be saying something like "invalid suffix +2 on constant". – Lundin Apr 11 '18 at 13:30
  • @Lundin That's close but not quite correct. `0xe+2` is a _valid_ "pp-number" "preprocessing token", but then in translation phase 7, each "preprocessing token" is required to be converted to a _single_ "token", and that process fails. (I'm not sure where this requirement is stated besides the list of translation phases in 5.1.1.2, but 5.1.1.2 is normative, so.) – zwol Apr 11 '18 at 13:47
  • 1
    Yes, probably a "correcter" version of the error message would be "preprocessing number is not a valid integer or floating point constant". I am not sure that this would be easier to understand, though. – Jens Gustedt Apr 11 '18 at 14:46
13

My question is whether a convention exists to write simple additions of hexadecimal and decimal numbers in C

The convention is to use spaces. This is actually mandated by C11 6.4 §3:

Preprocessing tokens can be separated by white space; this consists of comments (described later), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both.

Where plain space is the commonly used one.

Similar exotic issues exist here and there in the language, some examples:

  • ---a must be rewritten as - --a.
  • a+++++b must be rewritten as a++ + ++b.
  • a /// comment
    b;
    must be rewritten as
    a / // comment
    b

And so on. The culprit in all of these cases is the token parser which follows the so-called "maximal munch rule", C11 6.4 §4:

If the input stream has been parsed into preprocessing tokens up to a given character, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token.

In this specific case, the pre-processor does not make any distinction between floating point constants and integer constants, when it builds up a pre-processing token called pp-number, defined in C11 6.4.8:

pp-number e sign
pp-number E sign
pp-number p sign
pp-number P sign
pp-number .

A preprocessing number begins with a digit optionally preceded by a period (.) and may be followed by valid identifier characters and the character sequences e+, e-, E+, E-, p+, p-, P+, or P-.

Here, pp-number does apparently not have to be a floating point constant, as far as the pre-processor is concerned.


( As a side note, a similar convention also exists when terminating hexadecimal escape sequences inside strings. If I for example want to print the string "ABBA" on a new line, then I can't write

puts("\xD\xABBA"); (CR+LF+string)

Because the string in this case could be interpreted as part of the hex escape sequence. Instead I have to use white space to end the escape sequence and then rely on pre-processor string concatenation: puts("\xD\xA" "BBA"). The purpose is the same, to guide the pre-processor how to parse the code. )

Lundin
  • 195,001
  • 40
  • 254
  • 396
  • 1
    This doesn't seem to explain why `0xE+3` could constitute a preprocessing token, but `0xF+3` couldn't. – Ruslan Apr 11 '18 at 10:22
  • 1
    @Ruslan No, because this only answers the question. The reason why has to do with the maximum munch rule and the rather blunt _pp-number_ syntax in 6.4.8, where the standard apparently does not make any difference between integer constants and floating point constants. The longest valid token sequence here is _pp-number_ **e** _sign_. – Lundin Apr 11 '18 at 10:58
  • I added that part to the answer too. – Lundin Apr 11 '18 at 11:03
  • 1
    Another example of “reasonable” code that requires spaces: `val/*ptr`. These things can also be fixed with parentheses if one does not like spaces. `-(--a)` may be considered more readable than `- --a`, even though they're both kind of awful. – wrtlprnft Apr 11 '18 at 11:44
  • @wrtlprnft There's a lot of ways to make the parser happy, but that doesn't mean they are convention. You could also write `val/&ptr[0]` or `val/+*ptr;` or `-+--a` but that's obfuscation. Using spaces is the most readable way. – Lundin Apr 11 '18 at 13:12
  • Obviously, this is all a matter of preference. Do you have any sources why whitespace is *the* convention? Personally, I would consider putting a space between a unary operator and its operand an obfuscation because that makes it look like a binary operator. – wrtlprnft Apr 11 '18 at 13:24
  • @wrtlprnft As quoted in this answer, the standard itself says that pre-processing tokens are separated by white space (or comments). Meaning that white space is the only universal way that is guaranteed to always work. Using various operators as in the various examples in the comments above is not an universal solution, since using those may either form one or several pre-processing tokens. – Lundin Apr 11 '18 at 13:39
  • Is there any practical advantage to having the preprocessor regard 1E+3 as a single token, versus having the compiler regard 1E as a "number expecting signed exponent" token? Once a digit is encountered, scanning until the next thing that isn't an alphanumeric character or period seems easier than having to exclude the two-character combinations [EePp][+-] from the exit conditions. – supercat Apr 12 '18 at 20:58
  • As an aside, I'm extremely curious what this `p` syntax is. I'm used to seeing `e` as a power of ten multiplier, but never seen `p` before. Dug up a copy of the document but it said nothing beyond the quoted text (mind you, there was 500 pages, so I was only content to CTRL + F). – Kat Apr 16 '18 at 19:48
  • @Kat The `p` syntax is for hexadecimal floating point constants, e.g. `0x1.921fb54442d18p+1` is a `double`-precision approximation to π. The standard doesn't explain why `p` is used, but it can't be `e` because `e` is a valid hexadecimal digit. – zwol May 02 '18 at 15:55