Handling blank lines when White Space is important in ANTLR4

Question

This may be a newbee question, since I don't have a lot of ANTLR experience, but I've done a lot of research and troubleshooting and have not found a solution so resorting to asking. I am trying to write a parser for a very odd format file (PCGEN open source role playing game character editor) that I plan to use for several uses, not the least of which is learning ANTLR. I am to the point that I have everything I want working on the LEX and Parse, except that it stops parsing when it hits blank lines. I know I could add a line to throw away all whitespace, but the file format is such that strings are not really quoted, and white space is usually important, so the only white space that should be ignored is a totally blank line. When I run the Lexer it gives the tokens for the entire file, so I thought the Parser would process the tokens without concern for where they came from, so I am missing something simple. Here is the beggining of my input:

PCGVERSION:2.0

# System Information
CAMPAIGN:Advanced Player's Guide|CAMPAIGN:Ultimate Magic|CAMPAIGN:Ultimate Combat
VERSION:6.07.05
ROLLMETHOD:3|EXPRESSION:2d6+6
PURCHASEPOINTS:N

And this is my current grammar:

grammar PCG;

pcgFile     :   lines=line+;

line        :   statement (NEWLINE | EOF)
            ;

statement   :   KEYWORD ASSIGN
            |   KEYWORD ASSIGN YES_NO
            |   KEYWORD ASSIGN TEXT
            |   KEYWORD ASSIGN VERSIONNUM
            |   KEYWORD ( ASSIGN INT )+
            |   KEYWORD ASSIGN INT
            |   KEYWORD ASSIGN SUB_START statement SUB_END
            |   statement SEP statement
            ;


NEWLINE         :       '\r\n' | 'r' | '\n' ; 
YES_NO          :       ('Y'|'N');
KEYWORD         :       [A-Z]+; 
INT             :       [0-9]+; 
TEXT            :       ~(':'|'|'|'\r'|'\n'|'['|']')+; 
ASSIGN          :       ':'; 
SEP             :       '|';

COMMENT         :       '#' ~[\r\n]*->skip ; 
VERSIONNUM      :       ([0-9]+ ('.' [0-9]+)?)
                |       ('.' [0-9]+)
                |       ([0-9]+ ('.' [0-9]+) ('.' [0-9]+)?)
                ; 

ROLL            :       INT [dD] INT (('+'|'-') INT)?;

SUB_START       :       '['; 
SUB_END         :       ']';

Any help would be appreciated.

Bart Kiers · Answer 1 · 2018-02-06T22:19:50.737

You need to allow for more than 1 new line between statements. Do that by removing the rule and rewriting to this:

pcgFile : NEWLINE* statement ( NEWLINE+ statement )* NEWLINE* EOF;

The main problem is that your lexer matches # System Information as a TEXT token. Whenever 2 or more rules match the same amount of characters, the rule defined first will "win" ^*. So that's TEXT. When you place COMMENT before TEXT, it will work:

grammar PCG;

pcgFile     :   NEWLINE* statement ( NEWLINE+ statement )* NEWLINE* EOF;

statement   :   KEYWORD ASSIGN
            |   KEYWORD ASSIGN YES_NO
            |   KEYWORD ASSIGN TEXT
            |   KEYWORD ASSIGN VERSIONNUM
            |   KEYWORD ( ASSIGN INT )+
            |   KEYWORD ASSIGN INT
            |   KEYWORD ASSIGN SUB_START statement SUB_END
            |   statement SEP statement
            ;

NEWLINE         :       '\r\n' | 'r' | '\n' ;
YES_NO          :       ('Y'|'N');
KEYWORD         :       [A-Z]+;
INT             :       [0-9]+;
COMMENT         :       '#' ~[\r\n]* ->skip ;
TEXT            :       ~(':'|'|'|'\r'|'\n'|'['|']')+;
ASSIGN          :       ':';
SEP             :       '|';

VERSIONNUM      :       ([0-9]+ ('.' [0-9]+)?)
                |       ('.' [0-9]+)
                |       ([0-9]+ ('.' [0-9]+) ('.' [0-9]+)?)
                ;

ROLL            :       INT [dD] INT (('+'|'-') INT)?;

SUB_START       :       '[';
SUB_END         :       ']';

Keep in mind that ~(':'|'|'|'\r'|'\n'|'['|']')+ is dangerous: it could easily match a lot of characters.

^* because the lexer works like this, input like 12 will never be tokenised as a VERSIONNUM token since INT matches this too an occurs before VERSIONNUM. Fix it by doing something like this:

statement   :   ...
            |   KEYWORD ASSIGN versionnum
            |   ...
            ;

versionnum  : VERSIONNUM 
            | INT
            ;

...

INT             :       [0-9]+;

...

VERSIONNUM      :       [0-9]* '.' [0-9]+ ('.' [0-9]+)?
                ;

...

Worked like a champ. I knew it would be a simple thing I should have thought of. As for the other points, the very greedy TEXT is unavoidable because of the un-delimited way the text fields are are in this file, and I need to distinguish between verionsnum and int. — Joe Bryant, Feb 06 '18 at 23:22

Handling blank lines when White Space is important in ANTLR4

And this is my current grammar:

1 Answers1