Disambiguating a left-recursive ANTL4 rule

Question

Let's consider this simple ANTL4 language grammar.

Lexer:

lexer grammar BiaLexer;

Lt                      : '<' ;
Gt                      : '>' ;
Identifier              : [a-zA-Z] ([a-zA-Z1-9] | ':')* ;
LeftParen               : '(' ;
RightParen              : ')' ;
Comma                   : ',' ;

Whitespace              : (' ' | '\n') -> skip ;

Parser:

parser grammar BiaParser;

options { tokenVocab = BiaLexer; }

typeExpression
    : referredName=Identifier # typeReference ;

expression
    : callee=expression callTypeVariableList LeftParen callArgumentList RightParen # callExpression
    | left=expression operator=Lt right=expression # binaryOperation
    | left=expression operator=Gt right=expression # binaryOperation
    | referredName=Identifier # reference
    | LeftParen expression RightParen # parenExpression ;

callTypeVariableList: Lt typeExpression (Comma typeExpression)* Gt ;

callArgumentList: (expression (Comma expression)*)? ;

So, basically, this language has only:

ordinary references, e.g. a
type references, e.g. A
comparisons, e.g. a < b or c > d
expressions wrapped in parenthesis, e.g. (a)
and, finally, generic function calls: e.g. f<A, B>(a, b) or f<A>(a) (similar to, let's say, Kotlin)

This grammar is ambiguous. A simple expression like f<A>(a) can be interpreted as...

...a generic call: Call(calle = ref:f, typeArgs = TypeArgs(typeRef:A), args = Args(ref:a))

...or a chain of comparisons between a reference, another reference and an parenthesised expression: Binary(op = >, left = Binary(op = <, left = ref:f, right = ref:A), right = Paren(ref:a))

The actual parser generated by ANTLR does the second, i.e. comparison chain. If I comment-out the binary operation rules...

//    | left=expression operator=Lt right=expression # binaryOperation
//    | left=expression operator=Gt right=expression # binaryOperation

...then the result is, as expected by me, the generic call.

Please note that I've, on purpose, put the #callExpression case on the top of the expression rule, with an intention of declaring that it has higher precedence than the comparison cases below. I believed that that's how one declares case precedence in ANTLR, but obviously it doesn't work in this case.

Questions:

why does ANTLR interpret f<A>(a) as a chain of comparisons?
how can I fix that, i.e. make the generic call have precedence over comparison chain?

If that matters, I can provide the code I've used to dump the AST to a pretty-string, but that's just a simple ANTLR visitor emitting a string. I've skipped it for readability.

can a `callee` REALLY be *ANY* `expression`? How would I know (not as the parser) that `f(a)` is a function call rather than "f is less than A is greater than (a)". Being able to express how you would do this yourself, will likely provide insight into how to disambiguate the grammar. — Mike Cargal, Feb 03 '22 at 17:27
@MikeCargal As I've noted in the question, I'd just like the generic call to have precedence over the chain-of-comparisons interpretation. It was my understanding that that's how it works in other languages; i.e. that in case of ambiguity, some rules can take precedence over the other. — cubuspl42, Feb 03 '22 at 17:48
Indded, my workaround was exactly to limit the possible rules for `callee`, but I cannot see why (conceptually) couldn't it be any expression. From the practical perspective, `f < A > (a)` has close to zero chance of being a well-typed expression. But if for some reason someone really wanted to express exactly kind of expression tree, they still could use additional parentheses. — cubuspl42, Feb 03 '22 at 17:52
For whatever reason, the predictive parse never selects the 1st alt in `expression`. I've seen things like this before when the alts you want with higher priority aren't selected because it's not unfolded. And, it could be a bug, or a design decision so that it doesn't backtrack so much. Try replacing with this: `expression : expression Lt ( typeExpression (Comma typeExpression)* Gt LeftParen callArgumentList RightParen | expression) | expression Gt expression | Identifier | LeftParen expression RightParen ;` which is a refactoring the unfolds one rule, and a second refactoring that regroups. — kaby76, Feb 03 '22 at 19:40

score 2 · Answer 1 · edited Feb 04 '22 at 09:04

I took a look at the ANTLR grammars for Swift and Rust. Both of them allowed only for some sort of identifier to precede the generic type specification (i.e. they did not allow for any expression to be used as a callee).

Using that approach, something like this parses your input just fine:

grammar Bia
    ;

typeExpression: referredName = Identifier # typeReference;

expression
    : callee=Identifier callTypeVariableList LeftParen callArgumentList RightParen # callExpr
    | left = expression operator = (Lt | Gt) right = expression             # binaryExpression
    | Identifier                                                            # reference
    | LeftParen expression RightParen                                       # parenExpression
    ;

callTypeVariableList
    : Lt typeExpression (Comma typeExpression)* Gt
    ;

callArgumentList: (expression (Comma expression)*)?;

Lt:         '<';
Gt:         '>';
Identifier: [a-zA-Z] ([a-zA-Z1-9] | ':')*;
LeftParen:  '(';
RightParen: ')';
Comma:      ',';

Whitespace: (' ' | '\n') -> skip;

You might find that you want a rule that is a bit more flexible about the sort of callee identifiers you want to allow, without it being just ANY sort of expression (There's probably a good argument that the boolean result of a < or > expression couldn't really serve as a callee anyway).

The following allows for much more flexibility and still correctly matches your expression:

grammar Bia
    ;

typeExpression: referredName = Identifier # typeReference;

expression
    : callExpression                                            # callExpr
    | left = expression operator = (Lt | Gt) right = expression # binaryExpression
    | Identifier                                                # reference
    | LeftParen expression RightParen                           # parenExpression
    ;

callExpression
    : callee = calleeIdentifier (callTypeVariableList)? LeftParen callArgumentList RightParen # idCall
    | callee = callExpression (callTypeVariableList)? LeftParen callArgumentList RightParen # exprCall
    ;

callTypeVariableList
    : Lt typeExpression (Comma typeExpression)* Gt
    ;

calleeIdentifier: Identifier ('.' Identifier)*;

callArgumentList: (expression (Comma expression)*)?;

Lt:         '<';
Gt:         '>';
Identifier: [a-zA-Z] ([a-zA-Z1-9] | ':')*;
LeftParen:  '(';
RightParen: ')';
Comma:      ',';

Whitespace: (' ' | '\n') -> skip;

NOTE: I also tried kaby76's suggestion, and it does handle your situation. You might find the resulting context class a bit awkward though (as there will be a single rule alternative that matches either a call of an LT expression).

Thank you for your answer! As I'm thinking of it, it makes sense that not all expressions are callable, as some expressions just _can't_ be called in the conventional operator precedence. It would be expected for `a + b.f(c)` to be parsed as addition of `a` and the result of `b.f (c)` call. It makes it impossible to call the `+` expression directly. — cubuspl42, Feb 04 '22 at 08:58
On the other hand, it would also be expected that `(a + b).f(c)` to be a call on the result of the addition. Sadly, it can't be parsed in the proposed grammar. Do you think it would be easy to support that particular case? I think that this single "feature" would bring this grammar more-or-less on par with popular general purpose language (in the topic of generic function calls). — cubuspl42, Feb 04 '22 at 09:01
I think I've got it; I included my adjustments in a [separate answer](https://stackoverflow.com/a/70984749/1483676). — cubuspl42, Feb 04 '22 at 10:23
Right, I wasn’t trying to build out an exhaustive example of a callable expression, but to point in the direction of how you could flesh it out (as you have). The key, of course, is that only certain types of expressions are callable. Glad you got it working. — Mike Cargal, Feb 04 '22 at 12:45

cubuspl42 · Answer 2 · 2022-02-04T13:14:08.753

To answer the first (my own) subquestion, I still don't know why doesn't the first #callExpression alternative "get picked" by ANTLR in the original grammar. This comment by kaby76 makes an educated guess that sounds reasonable.

Mike Cargal's answer solves the problem as described in the question very well. Building on top of it, I've adjusted the grammar so it also handles function call on parenthesised expressions (like (a)<A>(b) or (a + b)<A>(c)).

The slight difference in the approach is that, in my case, I have a separate rule for a "callable expression" (an expression that can be called), not for the call expression itself. Still, as you can "call the call" in this adjusted grammar, the call expression is an alternative in this rule.

The modified parser grammar looks like this:

parser grammar BiaParser;

options { tokenVocab = BiaLexer; }

typeExpression
    : referredName=Identifier # typeReference ;

expression
    : callableExpression # callableExpression_
    | left=expression operator=Lt right=expression # binaryOperation
    | left=expression operator=Gt right=expression # binaryOperation ;

callableExpression
    : LeftParen expression RightParen # parenExpression
    | callee=callableExpression callTypeVariableList? LeftParen callArgumentList RightParen # callExpression
    | referredName=Identifier # reference ;

callTypeVariableList: Lt typeExpression (Comma typeExpression)* Gt ;

callArgumentList: (expression (Comma expression)*)? ;

It allows the following kind of generic calls:

call on identifier, e.g. f<A>(a)
call on any parenthesised expression
call on call, e.g. f<A>(a, b)<B, C>(c, d, e)

I've verified with tests that all the above cases are parsed as expected.

One interesting thing to note is that, as far as I can see, this adjusted grammar doesn't really limit the programmer in any way, compared to the original grammar. It's difficult to reason about what it would mean to call a < / > expression directly, as even the original grammar (by intention and ordering) considered a < b<B>(c) to be a lt-comparison of reference a and b<B>(c) generic call (and that's what most programmers would, probably, expect). This (probably) generalises to all kind of binary operator (e.g. +, -) and possibly more kind of expressions appearing in general-purpose languages.

Disambiguating a left-recursive ANTL4 rule

2 Answers2