ANTLR4 error recovery issues for class bodies

Question

I've found a strange issue regarding error recovery in ANTLR4. If I take the grammar example from the ANTLR book

grammar simple;

prog:   classDef+ ; // match one or more class definitions

classDef
    :   'class' ID '{' member+ '}' // a class has one or more members
    ;

member
    :   'int' ID ';'                       // field definition
    |   'int' f=ID '(' ID ')' '{' stat '}' // method definition
    ;

stat:   expr ';'
    |   ID '=' expr ';'
    ;

expr:   INT 
    |   ID '(' INT ')'
    ;

INT :   [0-9]+ ;
ID  :   [a-zA-Z]+ ;
WS  :   [ \t\r\n]+ -> skip ;

and use the input

class T {
    y;
    int x;
}

it will see the first member as an error (as it expects 'int' before 'y').

classDef
 | "class"
 | ID 'T'
 | "{"
 |- member
 |   | ID "y" -> error
 |   | ";" -> error
 |- member
 |   | "int"
 |   | ID "x"
 |   | ";"

In this case ANTLR4 recovers from the error in the first member subrule and parses the second member correct.

But if the member classDef is changed from mandatory member+ to optional member*

classDef
    :   'class' ID '{' member* '}' // a class has zero or more members
    ;

then the parsed tree will look like

classDef
 | "class" -> error
 | ID "T" -> error
 | "{" -> error
 | ID "y" -> error
 | ";" -> error
 | "int" -> error
 | ID "x" -> error
 | ";" -> error
 | "}" -> error

It seems that the error recovery cannot solve the issue inside the member subrule anymore.

Obviously using member+ is the way forward as it provides the correct error recovery result. But how do I allow empty class bodies? Am I missing something in the grammar?

The DefaultErrorStrategy class is quite complex with token deletions and insertions and the book explains the theory of this class in a very good way. But what I'm missing here is how to implement custom error recovery for specific rules?

In my case I would add something like "if { is already consumed, try to find int or }" to optimize the error recovery for this rule.

Is this possible with ANTLR4 error recovery in a reasonable way at all? Or do I have to implement manual parser by hand to really gain control over error recovery for those use cases?

I can't reproduce the first result tree (with "member+"). It's the same as with "member*"--flat, for Java or C#, Antlr4.9.3, and TestRig. What version of Antlr4 are you using? What is your parser driver code? — kaby76, Apr 01 '22 at 11:31
I did the tests with the ANTLR plugin in Intellij IDEA 2021.3.2 (Ultimate Edition). The first case with `member+` is actually described in the book "The Definitive ANTLR4 Reference" on pg. 167 - so I haven't cross checked yet with my actual code. — sazz, Apr 02 '22 at 11:48
I was able to reproduce it only using Intellij 2021.3, Antlr plugin 1.17. (I don't use Intellij because the file chooser is not lazy, requires 5 minutes to respond.) It uses Antlr 4.9.1. But, the "flat" tree occurs for 4.9.1 (and v4.9.3) with "grun", and with simple default drivers in Java and CSharp. I don't know why there is a difference but likely the Antlr Intellij plugin code does something different that what people do with the default behavior, which always results in a flat tree, no "member" nodes. This requires additional analysis. A github.com/antlr/antlr4 issue should be created. — kaby76, Apr 03 '22 at 12:42
After reading the source and various Issues in github.com/antlr/, it turns out that the trees displayed in Intellij are not what is computed by a standard, generated parser via the Antlr4 tool .jar. This is because the Plugin uses an API to the parser generator in the tool, e.g., [here](https://github.com/antlr/antlr4/blob/97c793e446ba70e4e63f84e6c2bffd5fffd961a5/tool/src/org/antlr/v4/tool/GrammarParserInterpreter.java#L43) that "interprets" the grammar, and has does not execute actions. Will try to port the alternative tree construction code over to a standard parser. — kaby76, Apr 03 '22 at 16:09
For the given input, both the `member*` and `member+` options never enter the `member()` rule. The input causes the sync in classDef() to fail before it gets a chance to try for `member` sub-rule matches. In either case the parser syncs by consuming tokens until it gets to a 'class' token. — Chris, Apr 18 '22 at 19:41

Chris · Answer 1 · 2022-04-18T20:16:44.617

It is worth noting that the parser never enters the sub rule for the given input. The classDef rule fails before trying to match a member.

Before trying to parse the sub-rule, the sync method on DefaultErrorStrategy is called. This sync recognizes there is a problem and tries to recover by deleting a single token to see if that fixes things up.

In this case it doesn't, so an exception is thrown and then tokens are consumed until a 'class' token is found. This makes sense because that is what can follow a classDef and it is the classDef rule, not the member rule that is failing at this point.

It doesn't look simple to do correctly, but if you install a custom subclass of DefaultErrorStrategy and override the sync() method, you can get any recovery strategy you like.

Something like the following could be a starting point:

@Override
public void sync(Parser recognizer) throws RecognitionException {
  if (recognizer.getContext() instanceof simpleParser.ClassDefContext) {
    return;
  }

  super.sync(recognizer);
}

The result being that the sync doesn't fail, and the member rule is executed. Parsing the first member fails, and the default recovery method handles moving on to the next member in the class.

ANTLR4 error recovery issues for class bodies

1 Answers1