ANTLR4 g4 grammar to read key/value pair in different blocks

Question

I'm new to antlr, i'm trying to make simple grammar but i can't succeeded. I would like to parse this kind of file:

BEGIN HEADER
    CharacterSet "CP1252"
END HEADER
BEGIN DSJOB
    test "val"
END DSJOB
BEGIN DSJOB
    test "val2"
END DS

JOB

I'm using this kind of grammar :

grammar Hello;
dsxFile             :   headerDeclaration? jobDeclaration* EOF;
headerDeclaration   :   'BEGIN HEADER' param* 'END HEADER';
jobDeclaration      :   'BEGIN DSJOB' subJobDeclaration* param* 'END DSJOB';
subJobDeclaration       :   'BEGIN DSSUBJOB' param* 'END DSSUBJOB';

headParam
    :   (   'CharacterSet'     
        |   'name'  
        ) StringLiteral
    ;

// ANNOTATIONS
param   :   PNAME PVALUE;

PNAME :StringCharacters;
PVALUE :StringCharacters;
// STATEMENTS / BLOCKS
//block
//    :   '{' blockStatement* '}';

// LEXER

// Keywords

ABSTRACT      : 'abstract';
ASSERT        : 'assert';
BOOLEAN       : 'boolean';
BREAK         : 'break';
BYTE          : 'byte';
CASE          : 'case';
CATCH         : 'catch';
CHAR          : 'char';
CLASS         : 'class';
CONST         : 'const';
CONTINUE      : 'continue';
DEFAULT       : 'default';
DO            : 'do';
DOUBLE        : 'double';
ELSE          : 'else';
ENUM          : 'enum';
EXTENDS       : 'extends';
FINAL         : 'final';
FINALLY       : 'finally';
FLOAT         : 'float';
FOR           : 'for';
IF            : 'if';
GOTO          : 'goto';
IMPLEMENTS    : 'implements';
IMPORT        : 'import';
INSTANCEOF    : 'instanceof';
INT           : 'int';
INTERFACE     : 'interface';
LONG          : 'long';
NATIVE        : 'native';
NEW           : 'new';
PACKAGE       : 'package';
PRIVATE       : 'private';
PROTECTED     : 'protected';
PUBLIC        : 'public';
RETURN        : 'return';
SHORT         : 'short';
STATIC        : 'static';
STRICTFP      : 'strictfp';
SUPER         : 'super';
SWITCH        : 'switch';
SYNCHRONIZED  : 'synchronized';
THIS          : 'this';
THROW         : 'throw';
THROWS        : 'throws';
TRANSIENT     : 'transient';
TRY           : 'try';
VOID          : 'void';
VOLATILE      : 'volatile';
WHILE         : 'while';

//  Boolean Literals

BooleanLiteral  :   'true' |   'false';

//  Character Literals


fragment
SingleCharacter
    :   ~['\\] ;
//  String Literals
StringLiteral
    :   '"' StringCharacters? '"'
    ;
fragment
StringCharacters
    :   StringCharacter+
    ;
fragment
StringCharacter
    :   ~["\\]
    ;


// Separators

LPAREN          : '(';
RPAREN          : ')';
LBRACE          : '{';
RBRACE          : '}';
LBRACK          : '[';
RBRACK          : ']';
SEMI            : ';';
COMMA           : ',';
DOT             : '.';

// Operators

ASSIGN          : '=';
GT              : '>';
LT              : '<';
BANG            : '!';
TILDE           : '~';
QUESTION        : '?';
COLON           : ':';
EQUAL           : '==';
LE              : '<=';
GE              : '>=';
NOTEQUAL        : '!=';
AND             : '&&';
OR              : '||';
INC             : '++';
DEC             : '--';
ADD             : '+';
SUB             : '-';
MUL             : '*';
DIV             : '/';
BITAND          : '&';
BITOR           : '|';
CARET           : '^';
MOD             : '%';

ADD_ASSIGN      : '+=';
SUB_ASSIGN      : '-=';
MUL_ASSIGN      : '*=';
DIV_ASSIGN      : '/=';
AND_ASSIGN      : '&=';
OR_ASSIGN       : '|=';
XOR_ASSIGN      : '^=';
MOD_ASSIGN      : '%=';
LSHIFT_ASSIGN   : '<<=';
RSHIFT_ASSIGN   : '>>=';
URSHIFT_ASSIGN  : '>>>=';


//
// Additional symbols not defined in the lexical specification
//

AT : '@';
ELLIPSIS : '...';

//
// Whitespace and comments
//

WS  :  [ \t\r\n\u000C]+ -> skip
    ;

COMMENT
    :   '/*' .*? '*/' -> skip
    ;

LINE_COMMENT
    :   '//' ~[\r\n]* -> skip
;

But i'm still getting this issue :

line 1:0 mismatched input 'BEGIN HEADER\r\n\tCharacterSet ' expecting {, 'BEGIN HEADER', 'BEGIN DSJOB'} (dsxFile BEGIN HEADER\r\n\tCharacterSet "CP1252" \r\nEND HEADER\r\nBEGIN DSJOB\r\n\ttest "val" \r\nEND DSJOB)

Can someone explain me what does it means ? It seems it can't skip \r\t.

Thanks for your help guys !

Please post a **complete, working grammar** that produces the error you mentioned. — Bart Kiers, Feb 09 '18 at 16:40

Bart Kiers · Accepted Answer · 2018-02-09T17:32:56.723

The problem is that your input is not tokenised as you expect. This is because the lexer matches as much input as possible. So if you look at the PNAME rule:

PNAME : StringCharacters;

fragment StringCharacter
 : ~["\\]
 ;

then you will notice that the input "BEGIN HEADER\n CharacterSet " matches that rule.

This is what the error message:

mismatched input 'BEGIN HEADER\r\n\tCharacterSet ' expecting {, 'BEGIN HEADER', 'BEGIN DSJOB'}

is telling: the token 'BEGIN HEADER\r\n\tCharacterSet ' is found, while the parser expects one of the tokens 'BEGIN HEADER' or 'BEGIN DSJOB'.

You will probably need to add spaces, tabs and line breaks to that class: ~["\\ \t\r\n] (but that is for you to decide)

Also, the lexer operates independently from the parser (the parser has no influence on what tokens are produced). The lexer simply tries to match as much characters as possible, and whenever there are two (or more) rules that match the same characters, the rule defined first "wins". Given this logic, then from the following rules:

PNAME : StringCharacters;
PVALUE : StringCharacters;

it is apparent that the rule PVALUE will never be matched (only PNAME, since that one is defined first).

Here's how you could parse your example input:

grammar Hello;

dsxFile            : headerDeclaration? jobDeclaration* EOF;
headerDeclaration  : BEGIN HEADER param* END HEADER;
jobDeclaration     : BEGIN DSJOB subJobDeclaration* param* END DSJOB;
subJobDeclaration  : BEGIN DSSUBJOB param* END DSSUBJOB;
param              : PNAME pvalue;
pvalue             : STRING /* other alternaives here? */;

STRING       : '"' ~["\r\n]* '"';
BEGIN        : 'BEGIN';
END          : 'END';
HEADER       : 'HEADER';
DSJOB        : 'DSJOB';
DSSUBJOB     : 'DSSUBJOB';

WS           : [ \t\r\n\u000C]+ -> skip;
COMMENT      : '/*' .*? '*/'    -> skip;
LINE_COMMENT : '//' ~[\r\n]*    -> skip;

// Be sure to put this rule _after_ the rules BEGIN, END, HEADER, ...
// otherwise this rule will match those keywords instead
PNAME        : ~["\\ \t\r\n]+;

Of course you'll need to change it to suit your needs exactly, but it's a start.

Wow thanks you so much, all becomes clear explained like that, i'll try to apply this. — Damien F, Feb 12 '18 at 13:49

ANTLR4 g4 grammar to read key/value pair in different blocks

1 Answers1