ANTLR Distinguish DXF group codes and integers

Question

I'm failry new to ANTLR and I'm trying to write a parser for DXF files with ANTLRv4. DXF files use so called group codes to specify the type of the following data.

Example excerpt from some DXF file:

  0
SECTION
  2
HEADER
  9
$ORTHOMODE
 70
     0
  9
  0
ENDSEC

For example the first 0 means that in the next line a String follows. The group code 70 means that an 16Bit Integer will follow, in the example it's a 0. My problem now is e.g. how can distinguish between the group code 0 and Integer 0. In the example snippet it seems that Integer values have some special indentation, but I couldn't find anything about this in the DXF reference.

My idea so far was following ANTLR grammar:

grammar SimpleDXF;

start       :   HEADER variable* ENDSEC ;
variable    :   varstart (groupcode NL value NL)+ ;
varstart    :   VAR ;
groupcode   :   INT ;
value       :   INT | ANYCHARSEQ ;

WS          :   [ \t]+ -> skip ;  
NL          :   '\r'? '\n' ;
HEADER      :   '0' NL 'SECTION' NL '2' NL 'HEADER' NL ;
ENDSEC      :   '0' NL 'ENDSEC' NL ;
VAR         :   '9' NL VARNAME NL ;
VARNAME     :   '$' LETTER (LETTER | DIGIT)* NL ;
INT         :   DIGIT+ NL ;
ANYCHARSEQ  :   ANYCHAR+ NL ;

fragment ANYCHAR    :   [\u0021-\u00FF] ;
fragment LETTER     :   [A-Za-z_] ;
fragment DIGIT      :   [0-9] ;

But obviously this fails when trying to parse the Integer 0, since this is regarded as the group code 0 by the lexer, cause of the header rule.

So now I'm clueless how to resolve my problem. Any help is highly appreciated.

EDIT

changed ANTLR grammar to include more lexer rules. Now the problem is that the lexer completely fails. The first input character is an INT token instead of a part of the HEADER token like I intended it to be... The reason for this is that removing whitespace with -> skip will not work if it's inside a single token (see following example):

For input A B (space between the two letters) the this grammar will work:

start   :   'A' 'B' ;
WS      :   [ \t\r\n]+ -> skip ;

But this grammar will not work:

start   :   AB ;
AB      :   'A' 'B' ;
WS      :   [ \t\r\n]+ -> skip ;

score 1 · Accepted Answer · answered May 26 '14 at 15:58

I've solved the problem by doing some preprocessing, where every group code and it's corresponding value are on the same line. The preprocessing also eliminates leading and trailing whitespaces as @UweAllner suggested. The example input file from the question after preprocessing looks like this:

0 SECTION
2 HEADER
9 $ORTHOMODE
70 0
0 ENDSEC

Like this its easily possible to distinguish group codes and simple integers, cause group codes are always at the start of a line, while integers are at the end of a line. The following example grammar solves the problem:

grammar SimpleDXF;

start           :   HEADER variable* ENDSEC ;
variable        :   varstart groupcodevalue+ ;
varstart        :   VAR ;
groupcodevalue  :   GROUPCODE value ;
value           :   (INT | ANYCHARSEQ) NL ;

NL              :   '\r'? '\n' ;
HEADER          :   '0 SECTION' NL '2 HEADER' NL ;
ENDSEC          :   '0 ENDSEC' NL ;
VAR             :   '9 ' VARNAME NL ;
GROUPCODE       :   INT ' ' ;
VARNAME         :   '$' LETTER (LETTER | DIGIT)* ;
INT             :   '-'? DIGIT+ ;
ANYCHARSEQ      :   ANYCHAR+ ;

fragment ANYCHAR:   [\u0021-\u00FF] ;
fragment LETTER :   [A-Za-z_] ;
fragment DIGIT  :   [0-9] ;

Uwe Allner · Answer 2 · 2014-05-26T07:25:50.243

0

You are missing a rule like

group: groupcode NL value;

Otherwise (as you say) no distinction is possible between groupcodes and values as such. Or, if one groupcode may be followed by several values:

group: groupcode (NL value)+;

And you should define header and endsec as HEADER and ENDSEC to allow the lexer to distinguish between "just a number" and "is the start of a sequence". The same possibly for the start of the variable rule (and everything consisting of a fixed sentence).

EDIT: Something like

HEADER      :   '0' WS* NL WS* 'SECTION' WS* NL WS* '2' WS* NL WS* 'HEADER' WS* NL ;

comes spontaneously to my mind, while not being very elegant. But strange file formats require exotic measures.

To straighten this out a little, would it be possible for you to trim the lines of leading and trailing whitespace before they are lexed and parsed?

edited May 26 '14 at 07:25

answered May 22 '14 at 11:52

Uwe Allner

3,399
9
35
49

I implicitly have this in the `variable` rule as a subrule: `(groupcode NL value NL)+` I also tried exchanging this subrule with your suggestion, but as expected I still get the same result... – schauk11erd May 22 '14 at 12:10
The example you gave is indeed not parseable by this rule; with groupcode 70 and 0 as value consumed there stays a 0 between this and the presumed endsec consisting of 0 NL ENDSEC. Is there more than one value possible per groupcode? – Uwe Allner May 22 '14 at 12:14
There is only one value possible per group code, but a variable in the header section may have multiple parameters (group code + value). IMO the problem is that the value `0` is in the wrong token class because of the `header` rule, where the `'0' ...` causes the lexer to make a token for zeros. – schauk11erd May 22 '14 at 12:19
1

@Ibizarudi I gave it another try; see answer – Uwe Allner May 26 '14 at 07:26
Thanks for your input. I was able to come up with a solution with preprocessing the dxf file thanks to your latest edit. – schauk11erd May 26 '14 at 16:00

ANTLR Distinguish DXF group codes and integers

2 Answers2