ANTLR: parsing header followed by binary data chunk with unknown length

Question

There in a data stream are two packets. Each has the header followed by some binary data with unknown length, until another header is found, or EOF is reached. Here is the data: HDR12HDR345 HDR is the header marker 12 and 345 are the binary data.

And here is my current wrong grammar:

grammar TEST;

parse   :   STREAM EOF;
STREAM  :   PACKET*;
PACKET  :   HEADER DATA;
HEADER  :   'HDR';
DATA    :   .*;

The first header token is recognized, but the data token is too long and it consumes the next header and data.

After three days of looking for the solution I did not found any, which matches both, "binary data" and "unknown length" aspects. But stil I think that this must be some common scenario for parsing. ANTLR is not as easy as it looks like for the first sight :(

Thanks for any help or suggestions.

Bart Kiers · Answer 1 · 2011-12-19T07:45:47.833

Without anything directly placed after .*, ANTLR will consume as much as possible (until the EOF). So the rule:

DATA : .*;

should be changed (there has to be something after .*).

Also, every lexer rule should at least match a single character. But your STREAM rule could potentially match an empty string, causing your lexer to create an infinite amount of empty string tokens.

Lastly, ANTLR is meant to parse textual input, not binary data. See this Q&A on the ANTLR mailing list for more info, or do a search on the list.

EDIT

Besides placing something after the .*, you can also perform a bit of "manual" look ahead in the lexer. A small demo how you can tell ANTLR to keep consuming characters until the lexer "sees" something ahead ("HDR", in your case):

grammar T;

@parser::members {
  public static void main(String[] args) throws Exception {
    String input = "HDR1 foo HDR2 bar \n\n baz HDR3HDR4 the end...";
    TLexer lexer = new TLexer(new ANTLRStringStream(input));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}

@lexer::members {
  private boolean hdrAhead() {
    return input.LA(1) == 'H' && 
           input.LA(2) == 'D' && 
           input.LA(3) == 'R';
  }
}

parse  : stream EOF;
stream : packet*; // parser rules _can_ match nothing
packet : HEADER DATA? {System.out.println("parsed :: " + $text.replaceAll("\\s+", ""));};
HEADER : 'HDR' '0'..'9'+;
DATA   : ({!hdrAhead()}?=> .)+;

If you run the demo above:

java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar TParser

(on Windows, the last command is: java -cp .;antlr-3.3.jar TParser)

the following is printed to the console:

parsed :: HDR1foo
parsed :: HDR2barbaz
parsed :: HDR3
parsed :: HDR4theend...

for the input string:

HDR1 foo HDR2 bar 

baz HDR3HDR4 the end...

Thanks. Is it the "something" something like: "Dear ANTLR look forward for HEADER token occurrence. If you find it, start parsing a NEW token from there?" — vita, Dec 18 '11 at 21:27
More or less. The `.*` needs to stop before the next `'HDR'`. — Bart Kiers, Dec 18 '11 at 21:55

ANTLR: parsing header followed by binary data chunk with unknown length

1 Answers1

EDIT