0

so I've written the following grammar in ANTLR:

grammar PARVA;


prog : lexeme* ;


lexeme :TOK_STRLIT
       | TOK_INTLIT
       | TOK_CHARLIT
       | ID
       | LINE_COMMENT
       | COMMENT
       | TOK_ASSERT
       | TOK_BOOL
       | TOK_BOOLEAN
       | TOK_BREAK
       | TOK_CHAR
       | TOK_CIN
       | TOK_CONST
       | TOK_COUT
       | TOK_DO
       | TOK_ELSE
       | TOK_ENUM
       | TOK_EOF
       | TOK_EOLN
       | TOK_EXIT       
       | TOK_FALSE      
       | TOK_FOR        
       | TOK_GET       
       | TOK_IF        
       | TOK_INLINE   
       | TOK_INT      
       | TOK_MOD      
       | TOK_NEW       
       | TOK_PRINT      
       | TOK_PRINTLN    
       | TOK_RANDOM     
       | TOK_RANDOMSEED 
       | TOK_READ      
       | TOK_RETURN     
       | TOK_STACKDUMP  
       | TOK_TRUE      
       | TOK_VAL       
       | TOK_VOID       
       | TOK_WHILE      
       | TOK_WRITE      
       | TOK_TOUPPER    
       | TOK_TOLOWER   
       | TOK_OP_NOT    
       | TOK_BITOR     
       | TOK_OR       
       | TOK_BITAND   
       | TOK_AND    
       | TOK_OP_REL 
       | TOK_OP_ASSIGN 
       | TOK_OP_ADD 
       | TOK_OP_TIMES
       | TOK_INC
       | TOK_DEC    
       | TOK_COMMA  
       | TOK_COLON  
       | TOK_SEMI   
       | TOK_LSHIFT 
       | TOK_RSHIFT 
       | TOK_LB    
       | TOK_RB   
       | TOK_LC    
       | TOK_RC     
       | TOK_LP   
       | TOK_RP
       | WS
       ;  
Letter     : [a-zA-Z] ;
Digit      : [0-9] ;
Hex_Digit  : [a-fA-F0-9] ;
UNICODE    : 'u' Hex_Digit Hex_Digit Hex_Digit Hex_Digit ;
ESC        : '\\"'
           | '\\\\'
           ;

TOK_STRLIT  : '"' (ESC|.)*? '"' ;
TOK_INTLIT  : [0-9]+ ;
TOK_CHARLIT : '\\'('a' | 'b' | 'f' | 'n' | 'r' | 't' | UNICODE ) | '\'' Letter '\'' | '\'' Digit '\'' ;
ID          : Letter (Letter | Digit | '_' )* ;
WS          : [ \t\r\n]+ -> skip ;



LINE_COMMENT : '//' .*? '\n' -> skip ;
COMMENT      : '/*' .*? '*/' -> skip ;

TOK_ASSERT     : 'assert' ;
TOK_BOOL       : 'bool' ;
TOK_BOOLEAN    : 'boolean' ;
TOK_BREAK      : 'break' ; 
TOK_CHAR       : 'char' ;
TOK_CIN        : 'cin' ;
TOK_CONST      : 'const' ;
TOK_COUT       : 'cout' ;
TOK_DO         : 'do' ;
TOK_ELSE       : 'else' ;
TOK_ENUM       : 'enum' ;
TOK_EOF        : 'eof' ;
TOK_EOLN       : 'eoln' ;
TOK_EXIT       : 'exit' ;
TOK_FALSE      : 'false' ;
TOK_FOR        : 'for' ;
TOK_GET        : 'get' ;
TOK_IF         : 'if' ;
TOK_INLINE     : 'inline' ;
TOK_INT        : 'int' ;
TOK_MOD        : 'mod' ;
TOK_NEW        : 'new' ;
TOK_PRINT      : 'print' ;
TOK_PRINTLN    : 'println' ;
TOK_RANDOM     : 'random' ;
TOK_RANDOMSEED : 'randomseed' ;
TOK_READ       : 'read' ;
TOK_RETURN     : 'return' ;
TOK_STACKDUMP  : 'stackdump' ;
TOK_TRUE       : 'true' ;
TOK_VAL        : 'val' ;
TOK_VOID       : 'void' ;
TOK_WHILE      : 'while' ;
TOK_WRITE      : 'write' ;
TOK_TOUPPER    : 'toUpperCase' ;
TOK_TOLOWER    : 'toLowerCase' ;



TOK_OP_NOT    : '!' ;
TOK_BITOR     : '|' ;
TOK_OR        : '||' ;
TOK_BITAND    : '&' ;
TOK_AND       : '&&' ;
TOK_OP_REL    : '==' 
              | '!='
              | '<'
              | '<='
              | '>'
              | '>=' 
              ;
TOK_OP_ASSIGN : '='
              | '%='
              | '&='
              | '|='
              | '*='
              | '+='
              | '-='
              | '/='
              ;
TOK_OP_ADD    : '+'
              | '-'
              ;
TOK_OP_TIMES  : '*'
              | '/'
              | '%'
              ;
TOK_INC       : '--' ;
TOK_DEC       : '++' ;


TOK_COMMA  : ',' ;
TOK_COLON  : ':' ;
TOK_SEMI   : ';' ;
TOK_LSHIFT : '<<' ;
TOK_RSHIFT : '>>' ;
TOK_LB     : '[' ;
TOK_RB     : ']' ;
TOK_LC     : '{' ;
TOK_RC     : '}' ;
TOK_LP     : '(' ;
TOK_RP     : ')' ;

but when I give it the following as an input:

int main(){
   int a;
}

I get the following error :

extraneous input 'a' expecting {<EOF>, TOK_STRLIT, TOK_INTLIT, TOK_CHARLIT, ID, WS, LINE_COMMENT, COMMENT, 'assert', 'bool', 'boolean', 'break', 'char',  'cin', 'const', 'cout', 'do', 'else', 'enum', 'eof', 'eoln', 'exit', 'false',  'for', 'get', 'if', 'inline', 'int', 'mod', 'new', 'print', 'println', 'random', 'randomseed', 'read', 'return', 'stackdump', 'true', 'val', 'void', 'while', 'write', 'toUpperCase', 'toLowerCase', '!', '|', '||', '&', '&&', TOK_OP_REL, TOK_OP_ASSIGN, TOK_OP_ADD, TOK_OP_TIMES, '--', '++', ',', ':', ';', '<<', '>>', '[', ']', '{', '}', '(', ')'}

This is really frustrating I've been trying for hours and I can't find out what I did wrong, and I'm very new to ANTLR, what could be the problem?

  • According to http://stackoverflow.com/a/23639737/1980909, the tokeniser will match the longest string, and if there are multiple of the same length, it will match the first. Hex_Digit will match "a" and comes before ID in your definition. – Adrian Leonhard Feb 20 '15 at 17:49
  • possible duplicate of [Antlr Extraneous Input](http://stackoverflow.com/questions/23621660/antlr-extraneous-input) – Adrian Leonhard Feb 20 '15 at 17:49
  • 1
    @AdrianLeonhard: while this is a priority problem, surely the better ANTLR4 solution is to declare the non-token lexical descriptions as `fragment` so that they don't participate in the lexical match? (That's not in the nominated duplicate because it doesn't really apply to that case.) – rici Feb 20 '15 at 19:59

1 Answers1

1

As mentioned in comments (and a nominated duplicate question), the problem is that a matches Letter, whereas you want it to match ID. In principle, that happens because the definition of Letter is earlier in the grammar than the definition of ID. So you could fix it by rearranging the definitions.

You'd need to move the definition of Hex_Digit as well. And then you'd find that UNICODE matched some identifiers whose names start with a u.

But I think you never want a token to match Letter, Digit, Hex_Digit, UNICODE or ESC. These are only intended to be named fragments which appear in other lexical rules, not as tokens in their own right. (Personally, I'm not a big fan of this style, particularly for simple fragments like these ones, but everyone has their own style.) In that case, you should declare them explicitly as fragment so that they will not be matched as a token:

fragment Letter     : [a-zA-Z] ;
fragment Digit      : [0-9] ;
fragment Hex_Digit  : [a-fA-F0-9] ;
...

and then it doesn't matter where you put them in the grammar.

See https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules

rici
  • 234,347
  • 28
  • 237
  • 341