0

I am trying to get a solid strategy together to parse binary data that has embedded integrity symbols. Here are the construction rules in EBNF form:

Log ::= {Data};
Data ::= Key,DataList;
DataList ::= {Structure};

The issue is that Key can appear in the DataList - it's not escape coded. I can't think of anything better than a brute force method where the algorithm is:

-Index all Key locations
-foreach key, start trying to parse Structure
- if structure parse fails - try next key location // possible to lose good data

Does anyone know of a good strategy for doing something like this? I'm trying to keep the data loss to a minimum if there is corrupted records.

Any insight welcome!

  • How do you know where the next structure starts if a parse fails - are they all the same size? If so, you could just parse all of them, initially assuming they are correct, and then discard the ones that don't match the integrity check afterwards. – 500 - Internal Server Error Jun 07 '13 at 15:07
  • No, they're not all the same size. The structures DO have embedded length fields, but like you say, it doesn't help if there's a failure. So you're suggesting a 'parallel' method of parsing starting at the key indices? – Scott Pavetti Jun 07 '13 at 15:10
  • If there's corruption and no sure way to detect when the next structure starts, it seems that you would be stuck. – 500 - Internal Server Error Jun 07 '13 at 17:20

1 Answers1

2

What I ended up doing was putting headers on the data, and not just keys. The header has a sync block, crc and length. This makes it fairly fault tolerant. Any corruption would be limited to the message that the corruption is in. The parsing strategy is to locate all the sync blocks, decode the following header and try to parse out the data. A failure indicates either a false positive on the sync block or the record is actually corrupted.