1

Let's consider this simple ANTL4 language grammar.

Lexer:

lexer grammar BiaLexer;

Lt                      : '<' ;
Gt                      : '>' ;
Identifier              : [a-zA-Z] ([a-zA-Z1-9] | ':')* ;
LeftParen               : '(' ;
RightParen              : ')' ;
Comma                   : ',' ;

Whitespace              : (' ' | '\n') -> skip ;

Parser:

parser grammar BiaParser;

options { tokenVocab = BiaLexer; }

typeExpression
    : referredName=Identifier # typeReference ;

expression
    : callee=expression callTypeVariableList LeftParen callArgumentList RightParen # callExpression
    | left=expression operator=Lt right=expression # binaryOperation
    | left=expression operator=Gt right=expression # binaryOperation
    | referredName=Identifier # reference
    | LeftParen expression RightParen # parenExpression ;

callTypeVariableList: Lt typeExpression (Comma typeExpression)* Gt ;

callArgumentList: (expression (Comma expression)*)? ;

So, basically, this language has only:

  • ordinary references, e.g. a

  • type references, e.g. A

  • comparisons, e.g. a < b or c > d

  • expressions wrapped in parenthesis, e.g. (a)

  • and, finally, generic function calls: e.g. f<A, B>(a, b) or f<A>(a) (similar to, let's say, Kotlin)

This grammar is ambiguous. A simple expression like f<A>(a) can be interpreted as...

...a generic call: Call(calle = ref:f, typeArgs = TypeArgs(typeRef:A), args = Args(ref:a))

...or a chain of comparisons between a reference, another reference and an parenthesised expression: Binary(op = >, left = Binary(op = <, left = ref:f, right = ref:A), right = Paren(ref:a))

The actual parser generated by ANTLR does the second, i.e. comparison chain. If I comment-out the binary operation rules...

//    | left=expression operator=Lt right=expression # binaryOperation
//    | left=expression operator=Gt right=expression # binaryOperation

...then the result is, as expected by me, the generic call.

Please note that I've, on purpose, put the #callExpression case on the top of the expression rule, with an intention of declaring that it has higher precedence than the comparison cases below. I believed that that's how one declares case precedence in ANTLR, but obviously it doesn't work in this case.

Questions:

  • why does ANTLR interpret f<A>(a) as a chain of comparisons?
  • how can I fix that, i.e. make the generic call have precedence over comparison chain?

If that matters, I can provide the code I've used to dump the AST to a pretty-string, but that's just a simple ANTLR visitor emitting a string. I've skipped it for readability.

cubuspl42
  • 7,833
  • 4
  • 41
  • 65
  • @MikeCargal As I've noted in the question, I'd just like the generic call to have precedence over the chain-of-comparisons interpretation. It was my understanding that that's how it works in other languages; i.e. that in case of ambiguity, some rules can take precedence over the other. – cubuspl42 Feb 03 '22 at 17:48
  • Indded, my workaround was exactly to limit the possible rules for `callee`, but I cannot see why (conceptually) couldn't it be any expression. From the practical perspective, `f < A > (a)` has close to zero chance of being a well-typed expression. But if for some reason someone really wanted to express exactly kind of expression tree, they still could use additional parentheses. – cubuspl42 Feb 03 '22 at 17:52
  • 1
    For whatever reason, the predictive parse never selects the 1st alt in `expression`. I've seen things like this before when the alts you want with higher priority aren't selected because it's not unfolded. And, it could be a bug, or a design decision so that it doesn't backtrack so much. Try replacing with this: `expression : expression Lt ( typeExpression (Comma typeExpression)* Gt LeftParen callArgumentList RightParen | expression) | expression Gt expression | Identifier | LeftParen expression RightParen ;` which is a refactoring the unfolds one rule, and a second refactoring that regroups. – kaby76 Feb 03 '22 at 19:40

2 Answers2

2

I took a look at the ANTLR grammars for Swift and Rust. Both of them allowed only for some sort of identifier to precede the generic type specification (i.e. they did not allow for any expression to be used as a callee).

Using that approach, something like this parses your input just fine:

grammar Bia
    ;

typeExpression: referredName = Identifier # typeReference;

expression
    : callee=Identifier callTypeVariableList LeftParen callArgumentList RightParen # callExpr
    | left = expression operator = (Lt | Gt) right = expression             # binaryExpression
    | Identifier                                                            # reference
    | LeftParen expression RightParen                                       # parenExpression
    ;

callTypeVariableList
    : Lt typeExpression (Comma typeExpression)* Gt
    ;

callArgumentList: (expression (Comma expression)*)?;

Lt:         '<';
Gt:         '>';
Identifier: [a-zA-Z] ([a-zA-Z1-9] | ':')*;
LeftParen:  '(';
RightParen: ')';
Comma:      ',';

Whitespace: (' ' | '\n') -> skip;

You might find that you want a rule that is a bit more flexible about the sort of callee identifiers you want to allow, without it being just ANY sort of expression (There's probably a good argument that the boolean result of a < or > expression couldn't really serve as a callee anyway).

The following allows for much more flexibility and still correctly matches your expression:

grammar Bia
    ;

typeExpression: referredName = Identifier # typeReference;

expression
    : callExpression                                            # callExpr
    | left = expression operator = (Lt | Gt) right = expression # binaryExpression
    | Identifier                                                # reference
    | LeftParen expression RightParen                           # parenExpression
    ;

callExpression
    : callee = calleeIdentifier (callTypeVariableList)? LeftParen callArgumentList RightParen # idCall
    | callee = callExpression (callTypeVariableList)? LeftParen callArgumentList RightParen # exprCall
    ;

callTypeVariableList
    : Lt typeExpression (Comma typeExpression)* Gt
    ;

calleeIdentifier: Identifier ('.' Identifier)*;

callArgumentList: (expression (Comma expression)*)?;

Lt:         '<';
Gt:         '>';
Identifier: [a-zA-Z] ([a-zA-Z1-9] | ':')*;
LeftParen:  '(';
RightParen: ')';
Comma:      ',';

Whitespace: (' ' | '\n') -> skip;

NOTE: I also tried kaby76's suggestion, and it does handle your situation. You might find the resulting context class a bit awkward though (as there will be a single rule alternative that matches either a call of an LT expression).

cubuspl42
  • 7,833
  • 4
  • 41
  • 65
Mike Cargal
  • 6,610
  • 3
  • 21
  • 27
0

To answer the first (my own) subquestion, I still don't know why doesn't the first #callExpression alternative "get picked" by ANTLR in the original grammar. This comment by kaby76 makes an educated guess that sounds reasonable.

Mike Cargal's answer solves the problem as described in the question very well. Building on top of it, I've adjusted the grammar so it also handles function call on parenthesised expressions (like (a)<A>(b) or (a + b)<A>(c)).

The slight difference in the approach is that, in my case, I have a separate rule for a "callable expression" (an expression that can be called), not for the call expression itself. Still, as you can "call the call" in this adjusted grammar, the call expression is an alternative in this rule.

The modified parser grammar looks like this:

parser grammar BiaParser;

options { tokenVocab = BiaLexer; }

typeExpression
    : referredName=Identifier # typeReference ;

expression
    : callableExpression # callableExpression_
    | left=expression operator=Lt right=expression # binaryOperation
    | left=expression operator=Gt right=expression # binaryOperation ;

callableExpression
    : LeftParen expression RightParen # parenExpression
    | callee=callableExpression callTypeVariableList? LeftParen callArgumentList RightParen # callExpression
    | referredName=Identifier # reference ;

callTypeVariableList: Lt typeExpression (Comma typeExpression)* Gt ;

callArgumentList: (expression (Comma expression)*)? ;

It allows the following kind of generic calls:

  • call on identifier, e.g. f<A>(a)
  • call on any parenthesised expression
  • call on call, e.g. f<A>(a, b)<B, C>(c, d, e)

I've verified with tests that all the above cases are parsed as expected.

One interesting thing to note is that, as far as I can see, this adjusted grammar doesn't really limit the programmer in any way, compared to the original grammar. It's difficult to reason about what it would mean to call a < / > expression directly, as even the original grammar (by intention and ordering) considered a < b<B>(c) to be a lt-comparison of reference a and b<B>(c) generic call (and that's what most programmers would, probably, expect). This (probably) generalises to all kind of binary operator (e.g. +, -) and possibly more kind of expressions appearing in general-purpose languages.

cubuspl42
  • 7,833
  • 4
  • 41
  • 65