I would like to extract information from a body of text and be able to query it.
The structure of this body of text would be specified by a BNF grammar (or variant), and the information to extract would be specified at runtime (the syntax of the query does not matter at the moment).
So the requirements are simple, really:
- Receive some structured body of text
- Load it in an exploitable form using a grammar to parse it
- Run a query to select some portions of it
To illustrate with an example, suppose that we have such grammar (in a customized BNF format):
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<id> ::= 15 * digit
<hex> ::= 10 * (<digit> | a | b | c | d | e | f)
<anything> ::= <digit> | .... (all characters)
<match> ::= <id> (" " <hex>)*
<nomatch> ::= "." <anything>*
<line> ::= (<match> | <nomatch> | "") [<CR>] <LF>
<text> ::= <line>+
For which such text would be conforming:
012345678901234
012345678901234 abcdef0123
Nor the previous line nor this one would match
And then I would want to list all tags that appear in the rule, so for example using an XPath like syntax:
match//id
which would return a list.
This sounds relatively easy, except that I have two big constraints:
- the BNF grammar should be read at runtime (from a string/vector like structure)
- the queries will be read at runtime too
Some precisions:
- the grammar is not expected to change often so a "compilation" step to produce an in-memory structure is acceptable (and perhaps necessary to achieve good speed)
- speed is of the essence, bonus points for on-the-fly collection of the wanted portions
- bonus points for the possibility to have callbacks to disambiguate (sometimes the necessary disambiguation information might require DB access for example)
- bonus points for multipart grammars (favoring modularity and reuse of grammar elements)
I know of lex/yacc and flex/bison for example, however they appear to only create C / C++ code to be compiled, which is not what I am looking after.
Do you know of a robust library (preferably free and open-source) that can transform a BNF grammar into a parser "on-the-fly" and produce a structured in-memory output from a body of text using this parser ?
EDIT: I am open to alternatives. At the moment, the idea was that perhaps regexes could allow this extraction, however given the complexity of the grammars involved, this could get ugly quickly and thus maintaining the regexes would be quite a horrendous task. Furthermore, by separating grammars and extraction I hope to be able to reuse the same grammar for different extractions needs rather than having slightly different regexes each time.