Although the conclusion/answer is already in the comments, here an answers that explains it in a bit more detail.
When defining literal tokens inside parser rule (the 'ba'
, 'a'
and 'b'
), ANTLR implicitly creates the following grammar:
grammar G ;
start : s EOF ;
s : T__0 a b ;
a : T__1 ;
b : T__2 ;
T__0 : 'ba';
T__1 : 'b';
T__2 : 'a';
Now, when the lexer get the input "baba"
, it will create 2 T__0
tokens. The lexer is not driven by whatever the parser is trying to match. It works independently from the parser. The lexer creates tokens following these 2 rules:
- try to match as many characters as possible for a rule
- when 2 (or more) lexer rules match the same characters, let the one defined first "win"
Because of rule 1, it is apparent that 2 T__0
tokens are created.
As you already mentioned in a comment, removing the 'ba'
token (and using 'b'
followed by 'a'
) would resolve the issue:
grammar G ;
start : s EOF ;
s : 'b' 'a' a b ;
a : 'b' ;
b : 'a' ;
which would really be the grammar:
grammar G ;
start : s EOF ;
s : T__0 T__1 a b ;
a : T__0 ;
b : T__1 ;
T__0 : 'b';
T__1 : 'a';