13

Since C++11, it has been possible to create User Defined Literals. As expected, it's possible to return complex structs from such literals. However, when trying to use such operators as 123_foo.bar():

struct foo {
    int n;
    int bar() const { return n; }
};

constexpr foo operator ""_foo(unsigned long long test)
{
    return foo{ static_cast<int>(test) };
}

int main() {
    return 123_foo.bar();
}

GCC and Clang reject it, saying they can't find an operator""_foo.bar. MSVC accepts it. If I instead write 123_foo .bar(), all three compilers accept it

Who is right here? Is 123_foo.bar() ever valid?


Some extra information:


I'm inclined to believe that this is a GCC and Clang bug, as . is not part of a valid identifier.

W.F.
  • 13,888
  • 2
  • 34
  • 81
Justin
  • 24,288
  • 12
  • 92
  • 142
  • 2
    If I were to guess, maximal munch applies for some reason – Passer By Mar 01 '18 at 07:25
  • 2
    @PasserBy But the `.` is *after* the UDL identifier, so I can't see how maximal munch applies here. – Justin Mar 01 '18 at 07:26
  • 1
    Interesting. It seems `_foo.bar` can be a valid name of UDL, as per GCC and Clang! – Nawaz Mar 01 '18 at 07:28
  • @Nawaz Hmm. Doesn't that mean that this has to be a GCC and Clang bug, as [the standard's grammar](http://eel.is/c++draft/lex.ext#nt:ud-suffix) shows all the udl literals as ending in `ud-suffix`, which is just an `identifier`? – Justin Mar 01 '18 at 07:30
  • Another guess: a token starting with a number will include `.`. Can't dive into grammar rules right now. I think there's a dupe somewhere – Passer By Mar 01 '18 at 07:30
  • @Justin: I think so: it seems to be GCC and Clang bug, at least the error message is misleading. – Nawaz Mar 01 '18 at 07:32
  • 1
    @PasserBy Sounds right to me. By the same logic I expect `1_xe+2` to be invalid even if `operator""_xe`'s return type provides an `operator+` taking an `int`: `e+` is also allowed in numbers. –  Mar 01 '18 at 07:58
  • 1
    @hvd Yeah that falls under [maximal munch](http://en.cppreference.com/w/cpp/language/user_literal#Notes) as mentioned in cppreference. – Justin Mar 01 '18 at 07:59
  • 1
    @Justin How is it that for `1_foo.bar` you didn't see it, but for `1_foe+bar` you do? :) It's exactly the same logic for both. –  Mar 01 '18 at 08:03
  • @hvd I misread the cppreference excerpt. – Justin Mar 01 '18 at 08:04
  • @hvd I'd say the point sign is an element of an alphabet of the floating point number literal while plus character isn't... – W.F. Mar 01 '18 at 08:08
  • @W.F. What about `1e+5`? – Justin Mar 01 '18 at 08:09
  • 1
    @Justin good point, though after plus character you'll have to find number while e.g. `1.` it is valid floating point literal – W.F. Mar 01 '18 at 08:09
  • Just recently filed a bug report with MS https://developercommunity.visualstudio.com/content/problem/203264/dot-operator-immediately-following-user-defined-in.html?childToView=203252#comment-203252 Low expectations though... – Killzone Kid Mar 01 '18 at 09:54

1 Answers1

10

TLDR Clang and GCC are correct, you can't write a . right after a user defined integer/floating literal, this is a MSVC bug.

When a program gets compiled, it goes through 9 phases of translations in order. The key thing to note here is lexing (seperating) the source code into tokens is done before taking into consideration its semantic meaning.

In this phase, maximal munch is in effect, that is, tokens are taken as the longest sequence of characters that is syntactically valid. For example x+++++y is lexed as x ++ ++ + y instead of x + ++ ++ y even if the former isn't semantically valid.

The question is then what is the longest syntactically valid sequence for 123_foo.bar. Following the production rules for a preprocessing number, the exact sequence is

pp-number → pp-number identifier-nondigit → ... → pp-number identifier-nondigit³ →
pp-number nondigit³ → pp-number . nondigit³ → ... → pp-number nondigit⁴ . nondigit³ →
pp-number digit nondigit⁴ . nondigit³ → ... → pp-number digit² nondigit⁴ . nondigit³ →
digit³ nondigit⁴ . nondigit³

Which resolves to 123_foo.bar as seen in the error message

Passer By
  • 19,325
  • 6
  • 49
  • 96
  • I find it misleading that this doesn't appear in [[lex.ext\]](http://eel.is/c++draft/lex.ext) – Justin Mar 01 '18 at 08:12
  • @Justin: Plenty of reasons that I can imagine. Imagine currency literals; they're commonly abbreviated by 3 letter codes. – MSalters Mar 01 '18 at 08:15
  • 4
    @Justin The problem actually came before your linked paragraph. A program is first lexed into preprocessing tokens without any notion of literals. Matching a token against literals comes later, and is the content of your link. – Passer By Mar 01 '18 at 08:23
  • Please explain why the conversion from a pp token to a token that is required, and the subsequent parsing, does not produce an error at that stage. The grammar for an ud-suffix clearly states that it is an identifier, and any token that is not a valid identifier should result in a syntax error, no? – Griwes Mar 01 '18 at 12:06
  • To elaborate: I think this is the same case as with `1..2f`, which is a valid pp-token, and the compilers we are interested in correctly give an error of ".2f is not a valid suffix"; same way, both GCC and Clang should error out as "_foo.bar is not a valid ud suffix". – Griwes Mar 01 '18 at 12:09
  • @Griwes First `1..2f` as a whole is considered a preprocessor token. Then it is converted to a token. Then it is considered as a literal, by which point the compiler realized `1.` is the longest valid floating literal, with `.2f` as a suffix, which produces the error. From the error message, I can hypothesize that the compilers decided to let name lookup to produce the error, since identifier rules necessitates such an error, as opposed to check once again whether the suffix is valid. – Passer By Mar 01 '18 at 12:29
  • 1
    Great answer, however MSVC is free to accept an otherwise ill-formed program as a [language extension](http://eel.is/c++draft/intro.compliance) [intro.compliance/8]. It is theoretically required to emit a diagnostic, however, to the best of my knowledge, all major implementatons violate this rule and accept various extensions without emitting a diagnostic (e.g. `__attribute__` for GCC/Clang). Command line options (`/Za` for MSVC) may be used to disable language extensions, with limited success. – Arne Vogel Mar 01 '18 at 17:25
  • Is there any reason why `pp-number` goes right to left? If it's reversed so that it doesn't have left recursion, I think it's possible to modify the grammar a bit to do things like not allow a `.` after an identifier. – Justin Mar 01 '18 at 18:15
  • @Justin I'm no compiler writer, but I believe the rules are such so that lexing in phase 3 is incredibly easy – Passer By Mar 02 '18 at 00:57