ANTLR4: How to parse a WKT polygon string?

Question

I'm using ANLTR4 in java and I could parse a WKT polygon string like this

polygon((20 30, 30 40, 50 60, 20 30))

with this Lexer:

POLYGON: ('polygon'|'POLYGON')'(('[0-9:,-.eTZ" ]+'))';

because the numbers inside polygon(( )) can be datetime or float then it contains some characters.

However, I couldn't parse a polygon with inner polygon like this

polygon((20 30, 30 40, 50 60, 20 30), (20 30, 30 40, 50 60, 20 30), (20 30, 30 40, 50 60, 20 30))

when I tried to add () in Lexer, e.g:

POLYGON: ('polygon'|'POLYGON')'(('[0-9:,-.eTZ" \(\)]+'))';

Java throws exception, cannot find ")" with .

What can I do to make ANTLR4 can parse polygon((), (), (), ...)?

score 2 · Answer 1 · answered Jan 25 '18 at 11:48

I think that you shouldn't do it with just a lexer. You should use your lexer to split into symbols; e.g. 'polygon', '(', ')', ',', <number>, <date> and so on. Then implement a grammar to deal with the large scale syntax; e.g.

<polygon> ::= 'polygon' '(' <list> ')'

<list> ::= '(' ')' |
           '(' <element> ( ',' <element> ) * ')'

<element> ::= <number> | <date>

(The meta-syntax I'm using is sort of EBNF ....)

The problems with using just a regex-based lexer with no grammar are:

the regex is hard to read / hard to validate
you don't get a parse tree
you don't get any meaningful parser errors
the more complicated the regex is, the more likely you are to run into performance issues; e.g. https://www.regular-expressions.info/catastrophic.html

score 1 · Accepted Answer · edited Jan 25 '18 at 13:26

The lexer should only define the basic building blocks of the language. Polygon, list, etc. should be defined as parser rules.

Something like this should get you started:

grammar WKT;

parse
 : polygon EOF
 ;

polygon
 : POLYGON '(' ( points ( ',' points )* )? ')'
 ;

points
 : '(' ( value value ( ',' value value )* )? ')'
 ;

value
 : INT
 | FLOAT
 | DATE_TIME
 ;

POLYGON
 : [pP] [oO] [lL] [yY] [gG] [oO] [nN]
 ;

INT
 : DIGITS
 ;

FLOAT
 : DIGITS '.' DIGITS
 ;

DATE_TIME
 : D D D D '-' D D '-' D D 'T' D D ':' D D ':' D D [+-] D D ':' D D
 | D D D D '-' D D '-' D D 'T' D D ':' D D ':' D D 'Z'
 | D D D D D D D D 'T' D D D D D D 'Z'
 ;

SPACES
 : [ \t\r\n]+ -> skip
 ;

fragment DIGITS
 : D+
 ;

fragment D
 : [0-9]
 ;

The following input: POLYGON ((35 10, 45 45, 15 40, 10 20, 35 10), (20 30, 35 35, 30 20, 20 30)) will be parsed as follows:

ANTLR4: How to parse a WKT polygon string?

2 Answers2