Without anything directly placed after .*
, ANTLR will consume as much as possible (until the EOF). So the rule:
DATA : .*;
should be changed (there has to be something after .*
).
Also, every lexer rule should at least match a single character. But your STREAM
rule could potentially match an empty string, causing your lexer to create an infinite amount of empty string tokens.
Lastly, ANTLR is meant to parse textual input, not binary data. See this Q&A on the ANTLR mailing list for more info, or do a search on the list.
EDIT
Besides placing something after the .*
, you can also perform a bit of "manual" look ahead in the lexer. A small demo how you can tell ANTLR to keep consuming characters until the lexer "sees" something ahead ("HDR"
, in your case):
grammar T;
@parser::members {
public static void main(String[] args) throws Exception {
String input = "HDR1 foo HDR2 bar \n\n baz HDR3HDR4 the end...";
TLexer lexer = new TLexer(new ANTLRStringStream(input));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
@lexer::members {
private boolean hdrAhead() {
return input.LA(1) == 'H' &&
input.LA(2) == 'D' &&
input.LA(3) == 'R';
}
}
parse : stream EOF;
stream : packet*; // parser rules _can_ match nothing
packet : HEADER DATA? {System.out.println("parsed :: " + $text.replaceAll("\\s+", ""));};
HEADER : 'HDR' '0'..'9'+;
DATA : ({!hdrAhead()}?=> .)+;
If you run the demo above:
java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar TParser
(on Windows, the last command is: java -cp .;antlr-3.3.jar TParser
)
the following is printed to the console:
parsed :: HDR1foo
parsed :: HDR2barbaz
parsed :: HDR3
parsed :: HDR4theend...
for the input string:
HDR1 foo HDR2 bar
baz HDR3HDR4 the end...