antlr lexer rule matching a prefix of another rule

Question

I am not sure that the issue is actually the prefixes, but here goes.

I have these two rules in my grammar (among many others)

DOT_T  : '.' ;
AND_T  : '.AND.'  | '.and.'  ;

and I need to parse strings like this:

a.eq.b.and.c.ne.d
c.append(b)

this should get lexed as:

ID[a] EQ_T ID[b] AND_T ID[c] NE_T ID[d]
ID[c] DOT_T ID[append] LPAREN_T ID[b] RPAREN_T

the error I get for the second line is:

line 1:3 mismatched character "p"; expecting "n"

It doesn't lex the . as a DOT_T but instead tries to match .and. because it sees the a after ..

Any idea on what I need to do to make this work?

UPDATE

I added the following rule and thought I'd use the same trick

NUMBER_T
    : DIGIT+
        ( (DECIMAL)=> DECIMAL 
        | (KIND)=>    KIND
        )?
    ;

fragment DECIMAL
    : '.' DIGIT+ ;
fragment KIND
    : '.' DIGIT+ '_' (ALPHA+ | DIGIT+) ;

but when I try parsing this:

lda.eq.3.and.dim.eq.3

it gives me the following error:

line 1:9 no viable alternative at character "a"

while lexing the 3. So I'm guessing the same thing is happening as above, but the solution doesn't work in this case :S Now I'm properly confused...

Bart Kiers · Accepted Answer · 2012-04-03T18:56:31.160

Yes, that is because of the prefixed '.'-s.

Whenever the lexer stumbles upon ".a", it tries to create a AND_T token. If the characters "nd" can then not be found, the lexer tries to construct another token that starts with a ".a", which isn't present (and ANTLR produces an error). So, the lexer will not give back the character "a" and fall back to create a DOT_T token (and then an ID token)! This is how ANTLR works.

What you can do is optionally match these AND_T, EQ_T, ... inside the DOT_T rule. But still, you will need to "help" the lexer a bit by adding some syntactic predicates that force the lexer to look ahead in the character stream to be sure it can match these tokens.

A demo:

grammar T;  

parse
 : (t=. {System.out.printf("\%-10s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF
 ;

DOT_T  
 : '.' ( (AND_T)=> AND_T {$type=AND_T;}
       | (EQ_T)=>  EQ_T  {$type=EQ_T; }
       | (NE_T)=>  NE_T  {$type=NE_T; }
       )?
 ;

ID
 : ('a'..'z' | 'A'..'Z')+
 ;

LPAREN_T
 : '('
 ;

RPAREN_T
 : ')'
 ;

SPACE
 : (' ' | '\t' | '\r' | '\n')+ {skip();}
 ;

NUMBER_T
 : DIGIT+ ((DECIMAL)=> DECIMAL)?
 ;

fragment DECIMAL : '.' DIGIT+ ;
fragment AND_T   : ('AND' | 'and') '.' ;
fragment EQ_T    : ('EQ'  | 'eq' ) '.' ;
fragment NE_T    : ('NE'  | 'ne' ) '.' ;
fragment DIGIT   : '0'..'9';

And if you feed the generated parser the input:

a.eq.b.and.c.ne.d
c.append(b)

the following output will be printed:

ID         'a'
EQ_T       '.eq.'
ID         'b'
AND_T      '.and.'
ID         'c'
NE_T       '.ne.'
ID         'd'
ID         'c'
DOT_T      '.'
ID         'append'
LPAREN_T   '('
ID         'b'
RPAREN_T   ')'

And for the input:

lda.eq.3.and.dim.eq.3

the following is printed:

ID         'lda'
EQ_T       '.eq.'
NUMBER_T   '3'
AND_T      '.and.'
ID         'dim'
EQ_T       '.eq.'
NUMBER_T   '3'

EDIT

The fact that DECIMAL and KIND both start with '.' DIGIT+ is not good. Try something like this:

NUMBER_T
 : DIGIT+ ((DECIMAL)=> DECIMAL ((KIND)=> KIND)?)?
 ;

fragment DECIMAL : '.' DIGIT+;
fragment KIND    : '_' (ALPHA+ | DIGIT+); // removed ('.' DIGIT+) from this fragment

Note that the rule NUMBER_T will now never produce DECIMAL or KIND tokens. If you want that to happen, you need to change the type:

NUMBER_T
 : DIGIT+ ((DECIMAL)=> DECIMAL {/*change type*/} ((KIND)=> KIND {/*change type*/})?)?
 ;

@Milan, `lda.eq.3.and.dim.eq.3` *is* properly parsed. See the edited demo in my answer. If it doesn't with you, you'll need to provide the exact grammar that fails (there must be something else going wrong). — Bart Kiers, Apr 03 '12 at 18:04
hm... I added another rule that actually broke it... I updated the example above... — Milan, Apr 03 '12 at 18:27
thanks! it worked :) you have a beer waiting for you in Zurich :) — Milan, Apr 03 '12 at 20:00

antlr lexer rule matching a prefix of another rule

1 Answers1

EDIT