ANTLR: Why is this grammar rule for a tuples not LL(1)?

Question

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:

"Decision can match input such as "COMMA" using multiple alternatives: 1, 2

I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.

Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma

And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307

Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.

options {k=1; backtrack=no;}

tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';

DIGIT  : '0'..'9' ;
LOWER  : 'a'..'z' ;
UPPER  : 'A'..'Z' ;

IDENT  : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;

edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?

rici · Accepted Answer · 2022-08-21T18:51:55.807

Note:

The question has been edited since this answer was written. In the original, the grammar had the line:

tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';

and that's what this answer is referring to.

That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).

My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:

tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';

That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.

LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.

You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:

COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.

But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.

You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).

You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.

But, for what it's worth, here's a possible solution:

tuple  : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;

Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.

It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.

Thank you for the help! You are right about the typo. I updated the OP. Where my thinking is off is that the decision/state transition happens at `IDENT` and not at `COMMA`. I see how the former makes more sense and why my thinking was giving me an invisible `k+1`, so to speak. Your solution does seem to work and I hadn't considered using options like that. I will have to reflect on the rest of your answer. As an aside: Do you think the code I linked is doing some additional factoring or are they just mistaken on that expression being LL(1)? — ogr, Aug 21 '22 at 17:33
I will also try to understand why your solution works even though it seems to me that it has the same issue, i.e., at IDENT you don't know what the next COMMA would belong to. — ogr, Aug 21 '22 at 17:58
@ogr: There's only one possibility, honest. FOLLOW(idents) only is just { ')' }. So if `IDENT` is followed by `COMMA`, then the `COMMA` belongs to `(COMMA (idents)? )`. The only other possible lookahead is ')', which means that `(COMMA (idents)?)?` was not present. — rici, Aug 21 '22 at 18:03
I think I just worked that out on my own after looking at https://www.gatevidyalay.com/left-factoring-examples-compiler-design/ and stepping through it. Thank you again. — ogr, Aug 21 '22 at 18:05
As for the code you linked, [this line](https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307) is a comment. So it's kinda unimportant whether it's LL(1) or not. What matters is the actual code, and it's clear that the test in the code is not where that grammar indicates. (The test is after the COMMA is recognised, not before, which is, in effect, left-factoring.) — rici, Aug 21 '22 at 18:07
Personally, I'm not a big fan of top-down parsing. LALR(1) parsing doesn't make you think so much, and there are lots of LALR(1) parser generators (and even GLR/GLL/Earley generators, so options abound). But, anyway, if you're writing a recursive descent parser by hand, you have lots of possibilities other than rigidly conforming to the grammar. And people tend to write the correct code without thinking about it too much. That's great for writing grammars, and a pain if you're trying to analysie the code to recreate the grammar, for example to write a different tool. — rici, Aug 21 '22 at 18:11
Also, for future reference, editing a question after an answer is provided is discouraged here, particularly if it makes the answer non-sensical. The idea of SO is to create a permanent repository of questions and answers, and it is important that accepted answers address the question in the post (which is all that future readers will see). So editing the question means that people who have already answered need to edit their answers, which is an unreasonable requirement (particularly since answerers are not notified when questions are modified). — rici, Aug 21 '22 at 18:19
Thanks, I will keep that in mind. I was able to resolve the other ambiguities in my code thanks to your guidance. I am writing my compiler by hand as well. You should update your note to say ```tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';``` — ogr, Aug 21 '22 at 18:47

ANTLR: Why is this grammar rule for a tuples not LL(1)?

1 Answers1

Note: