1

I'm using ANLTR4 in java and I could parse a WKT polygon string like this

polygon((20 30, 30 40, 50 60, 20 30)) 

with this Lexer:

POLYGON: ('polygon'|'POLYGON')'(('[0-9:,-.eTZ" ]+'))';

because the numbers inside polygon(( )) can be datetime or float then it contains some characters.

However, I couldn't parse a polygon with inner polygon like this

polygon((20 30, 30 40, 50 60, 20 30), (20 30, 30 40, 50 60, 20 30), (20 30, 30 40, 50 60, 20 30))

when I tried to add () in Lexer, e.g:

POLYGON: ('polygon'|'POLYGON')'(('[0-9:,-.eTZ" \(\)]+'))';

Java throws exception, cannot find ")" with .

What can I do to make ANTLR4 can parse polygon((), (), (), ...)?

Bằng Rikimaru
  • 1,512
  • 2
  • 24
  • 50

2 Answers2

2

I think that you shouldn't do it with just a lexer. You should use your lexer to split into symbols; e.g. 'polygon', '(', ')', ',', <number>, <date> and so on. Then implement a grammar to deal with the large scale syntax; e.g.

<polygon> ::= 'polygon' '(' <list> ')'

<list> ::= '(' ')' |
           '(' <element> ( ',' <element> ) * ')'

<element> ::= <number> | <date>

(The meta-syntax I'm using is sort of EBNF ....)

The problems with using just a regex-based lexer with no grammar are:

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
1

The lexer should only define the basic building blocks of the language. Polygon, list, etc. should be defined as parser rules.

Something like this should get you started:

grammar WKT;

parse
 : polygon EOF
 ;

polygon
 : POLYGON '(' ( points ( ',' points )* )? ')'
 ;

points
 : '(' ( value value ( ',' value value )* )? ')'
 ;

value
 : INT
 | FLOAT
 | DATE_TIME
 ;

POLYGON
 : [pP] [oO] [lL] [yY] [gG] [oO] [nN]
 ;

INT
 : DIGITS
 ;

FLOAT
 : DIGITS '.' DIGITS
 ;

DATE_TIME
 : D D D D '-' D D '-' D D 'T' D D ':' D D ':' D D [+-] D D ':' D D
 | D D D D '-' D D '-' D D 'T' D D ':' D D ':' D D 'Z'
 | D D D D D D D D 'T' D D D D D D 'Z'
 ;

SPACES
 : [ \t\r\n]+ -> skip
 ;

fragment DIGITS
 : D+
 ;

fragment D
 : [0-9]
 ;

The following input: POLYGON ((35 10, 45 45, 15 40, 10 20, 35 10), (20 30, 35 35, 30 20, 20 30)) will be parsed as follows:

enter image description here

Jiri Tousek
  • 12,211
  • 5
  • 29
  • 43
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288