3

I'm trying to understand which alternatives in ANTLR rules prefer when several match. According to this answer, alternatives in lexer rules are unordered except when after a non-greedy pattern (*?, +?, ??). For example, this grammar:

lexer grammar Test;

X : 'z'*? (FOO | FOOBAR);
fragment FOO: 'foo';
BAR: 'bar';
fragment FOOBAR: 'foobar';

given input "foobar" matches two tokens: X "foo" and BAR "bar", because alternatives in X are ordered. If we remove 'z'*? or even change it to a greedy 'z'*, alternatives become unordered again and the only matched token is X "foobar".

However, if I change the rules to parser rules:

grammar Test;

x : 'z'*? (foo | foobar);
foo: 'foo';
bar: 'bar';
foobar: 'foobar';

greediness on 'z' doesn't seem to matter at all. Given input "foobar", rule x follows the second alternative and consumes the whole input, producing tree (x (foobar "foobar"))

The question is: is there a definitive documentation on how lexer and parser rules consume input and which matches they prefer when several are possible?

Community
  • 1
  • 1

1 Answers1

1

is there a definitive documentation on how lexer and parser rules consume input and which matches they prefer when several are possible?

Definitive Documentation (apart from reading the source code):

1) Sam Harwell's (the author) comments in stackoverflow

2) Terence Parr's book for ANTLR4

And for your case, the complete interpretation of the parsing rules can be found in Terence Parr's book:

Chapter 15.6 Wildcard Operator and Nongreedy Subrules->Nongreedy Lexer Subrules

After crossing through a nongreedy subrule within a lexical rule, all decision making from then on is “first match wins.” For example, alternative 'ab' in the rule right-side .*? ('a'|'ab') is dead code and can never be matched. If the input is ab, the first alternative, 'a', matches the first character and therefore succeeds. ('a'|'ab') by itself on the right side of a rule properly matches the second alternative for input ab. This quirk arises from a nongreedy design decision that’s too complicated to go into here.

So for a complete grammar like this:

grammar TestGrammar;
test:XXX  EOF;
WS: [ \t\f]+ -> channel(1);
CRLF: '\r'? '\n' -> channel(1);
XXX : 'z'*? (FOO | FOOBAR) {System.out.println(getText());};

fragment FOO: 'foo';
fragment BAR: 'bar';
fragment FOOBAR: 'foobar';

For an input like zfoo. It is tokenized by the XXX rule and the lexer action output confirms this. For input zfoobar. The first 4 characters zfoo still tokenized by the rule XXX leaving bar as unrecognized tokens because of the "first match wins" rule mentioned above.

And for non-greedy parser subrules:

Nongreedy Parser Subrules

Nongreedy subrules and wildcards are also useful within parsers to do “fuzzy parsing” where the goal is to extract information from an input file without having to specify the full grammar. In contrast to nongreedy lexer decision making, parsers always make globally correct decisions. A parser never makes a decision that will ultimately cause valid input to fail later during the parse. Here is the central idea: nongreedy parser subrules match the shortest sequence of tokens that preserves a successful parse for a valid input sentence.

Which doesn't impose ordering to the subrules.

JavaMan
  • 4,954
  • 4
  • 41
  • 69