Antlr4 doesn't recognize identifiers

Question

I'm trying to create a grammar which parses a file line by line.

grammar Comp;

options 
{
    language = Java;
}

@header {
    package analyseur;
    import java.util.*;
    import component.*;
}

@parser::members {
    /** Line to write in the new java file */
    public String line;
}

start   
        : objectRule        {System.out.println("OBJ");  line = $objectRule.text;}
        | anyString         {System.out.println("ANY");  line = $anyString.text;}
        ;

objectRule : ObjectKeyword ID ;

anyString : ANY_STRING ;


ObjectKeyword :  'Object' ;
ID  :   [a-zA-Z]+ ;
ANY_STRING :  (~'\n')+ ;
WhiteSpace : (' '|'\t') -> skip;

When I send the lexem 'Object o' to the grammar, the output is ANY instead of OBJ.

'Object o'   =>  'ANY'   // I would like OBJ

I know the ANY_STRING is longer but I wrote lexer tokens in the order. What is the problem ?

Thank you very much for your help ! ;)

The lexer behavior is to match the longest string, as you've mentioned. The order rule doesn't matter if the length is different. — Mephy, Jul 14 '15 at 17:09
All of your question on SO have been a bit vague. That is most likely why you haven't received any answers to them. Consider posting actual input you're trying to parse and explaining how exactly you want this input to be tokenized/parsed. — Bart Kiers, Jul 14 '15 at 19:14
Sorry but I'm french, so it's possible that my english is bad. ^^' As i said in my first post, I would like the grammar to print OBJ with the input 'Object o' but ANY is printed again. — Maluna34, Jul 14 '15 at 19:23

score 1 · Accepted Answer · edited May 23 '17 at 12:29

1

For lexer rules, the rule with the longest match wins, independent of rule ordering. If the match length is the same, then the first listed rule wins.

To make rule order meaningful, reduce the possible match length of the ANY_STRING rule to be the same or less than any key word or id:

ANY_STRING: ~( ' ' | '\n' | '\t' ) ; // also?: '\r' | '\f' | '_'

Update

To see what the lexer is actually doing, dump the token stream.

edited May 23 '17 at 12:29

Community

1
1

answered Jul 14 '15 at 20:33

GRosenberg

5,843
2
19
23

Thank you.So I have to report the character repetition to the associated rule ? – Maluna34 Jul 14 '15 at 22:24
Your question is not understood. Please try again. – GRosenberg Jul 14 '15 at 22:57
So I removed the '+' from the ANY_STRING token and I put it in the anyString rule. So I have this with the same first grammar : ANY_STRING: ~( ' ' | '\n' | '\t' ) ; anyString : ANY_STRING+ ; But here any other word is not recognized anymore. For example i have the following error "line 2:0 no viable alternative at input 'public'". I think it is recignized as an ID instead of anyString. There is a solution ? – Maluna34 Jul 15 '15 at 08:20
If the answer provided helped solve the problem identified in your original post, please accept the answer. If there is a follow-on problem, then post a new question with a **full** explanation of what you are trying to do and the difficulty encountered. – GRosenberg Jul 15 '15 at 18:08

Antlr4 doesn't recognize identifiers

1 Answers1