0

I am trying create a grammar for a format that follows a type-length-value convention. Can ANTLR4 read in a length value and then parse that many characters?

poke
  • 3
  • 1

1 Answers1

0

NO ...

From your question (which is very short so I could miss something ...) I gather you are mixing grammar and encoding rules.

When you say type-length-value, it sounds like an encoding rule to me (how to serialize a data). In my experience, you write this code yourself.

A grammar is at a higher level: it's a piece of text that describes something. Antlr will help you breaking this text into tokens and then into a tree that you can navigate. This step only handles text: if you were going that way to solve your problem, you would still have to handle type, length and value yourself.

EDIT: with a bit of googling I found this https://github.com/NickstaDB/SerializationDumper

YaFred
  • 9,698
  • 3
  • 28
  • 40
  • Thanks for the insightful comment, I think this answers my question already. I'm hoping you can provide some follow up information.... Specifically I am trying to write a parser for java serialization. Oracle has provided a BNF like grammar here: https://docs.oracle.com/javase/8/docs/platform/serialization/spec/protocol.html. Some parts of this grammar specify how long a string is followed by a non-null terminated string. Is ANTLR not a realistic approach for parsing this format? Is this not a context free grammar? – poke Jul 05 '18 at 00:10
  • I understand. My own experience was: https://github.com/yafred/asn1-tool where the distinction between grammar and encoding rules is very clear. I'm not sure the word grammar in Oracle's document is quite proper. Let me have a look ... – YaFred Jul 05 '18 at 07:37
  • I stick to my first answer. This is not text and this can't be handled by a grammar parser. See my edited answer for a project similar to yours – YaFred Jul 05 '18 at 09:20
  • I don't fully agree with @YaFred: Essentially the tricky part is only the lexing part: Whereas for text lexing/parsing, one can typically tokenize the input stream by splitting the input text at whitespace separators, this is not easily possible for binary TLV data. But let's say one can write a lexer that splits the TLV input stream into `Tag`, `Length` and `Value` tokens, then a parser would be able to parse this token sequence. (But then again, I am not a parser expert) – Stefan D. Nov 18 '19 at 16:02