is there a definitive documentation on how lexer and parser rules
consume input and which matches they prefer when several are possible?
Definitive Documentation (apart from reading the source code):
1) Sam Harwell's (the author) comments in stackoverflow
2) Terence Parr's book for ANTLR4
And for your case, the complete interpretation of the parsing rules can be found in Terence Parr's book:
Chapter 15.6 Wildcard Operator and Nongreedy Subrules->Nongreedy Lexer Subrules
After crossing through a nongreedy subrule within a lexical
rule, all decision making from then on is “first match wins.” For
example, alternative 'ab' in the rule right-side .*? ('a'|'ab') is
dead code and can never be matched. If the input is ab, the first
alternative, 'a', matches the first character and therefore succeeds.
('a'|'ab') by itself on the right side of a rule properly matches the
second alternative for input ab. This quirk arises from a nongreedy
design decision that’s too complicated to go into here.
So for a complete grammar like this:
grammar TestGrammar;
test:XXX EOF;
WS: [ \t\f]+ -> channel(1);
CRLF: '\r'? '\n' -> channel(1);
XXX : 'z'*? (FOO | FOOBAR) {System.out.println(getText());};
fragment FOO: 'foo';
fragment BAR: 'bar';
fragment FOOBAR: 'foobar';
For an input like zfoo
. It is tokenized by the XXX
rule and the lexer action output confirms this. For input zfoobar
. The first 4 characters zfoo
still tokenized by the rule XXX
leaving bar
as unrecognized tokens because of the "first match wins" rule mentioned above.
And for non-greedy parser subrules:
Nongreedy Parser Subrules
Nongreedy subrules and wildcards are also
useful within parsers to do “fuzzy parsing” where the goal is to
extract information from an input file without having to specify the
full grammar. In contrast to nongreedy lexer decision making, parsers
always make globally correct decisions. A parser never makes a
decision that will ultimately cause valid input to fail later during
the parse. Here is the central idea: nongreedy parser subrules match
the shortest sequence of tokens that preserves a successful parse for
a valid input sentence.
Which doesn't impose ordering to the subrules.