7

I would like to extract information from a body of text and be able to query it.

The structure of this body of text would be specified by a BNF grammar (or variant), and the information to extract would be specified at runtime (the syntax of the query does not matter at the moment).

So the requirements are simple, really:

  • Receive some structured body of text
  • Load it in an exploitable form using a grammar to parse it
  • Run a query to select some portions of it

To illustrate with an example, suppose that we have such grammar (in a customized BNF format):

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<id> ::= 15 * digit

<hex> ::= 10 * (<digit> | a | b | c | d | e | f)

<anything> ::= <digit> | .... (all characters)

<match> ::= <id> (" " <hex>)*

<nomatch> ::= "." <anything>*

<line> ::= (<match> | <nomatch> | "") [<CR>] <LF>

<text> ::= <line>+

For which such text would be conforming:

012345678901234
012345678901234 abcdef0123

Nor the previous line nor this one would match

And then I would want to list all tags that appear in the rule, so for example using an XPath like syntax:

match//id

which would return a list.


This sounds relatively easy, except that I have two big constraints:

  • the BNF grammar should be read at runtime (from a string/vector like structure)
  • the queries will be read at runtime too

Some precisions:

  • the grammar is not expected to change often so a "compilation" step to produce an in-memory structure is acceptable (and perhaps necessary to achieve good speed)
  • speed is of the essence, bonus points for on-the-fly collection of the wanted portions
  • bonus points for the possibility to have callbacks to disambiguate (sometimes the necessary disambiguation information might require DB access for example)
  • bonus points for multipart grammars (favoring modularity and reuse of grammar elements)

I know of lex/yacc and flex/bison for example, however they appear to only create C / C++ code to be compiled, which is not what I am looking after.

Do you know of a robust library (preferably free and open-source) that can transform a BNF grammar into a parser "on-the-fly" and produce a structured in-memory output from a body of text using this parser ?

EDIT: I am open to alternatives. At the moment, the idea was that perhaps regexes could allow this extraction, however given the complexity of the grammars involved, this could get ugly quickly and thus maintaining the regexes would be quite a horrendous task. Furthermore, by separating grammars and extraction I hope to be able to reuse the same grammar for different extractions needs rather than having slightly different regexes each time.

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • If all else fails, write a parser generator in Boost.Spirit. ;-) Meta meta with template metaprogramming. – Konrad Rudolph Jun 12 '12 at 14:29
  • I think this is a partial dupe: http://stackoverflow.com/questions/7392302/build-parser-from-grammar-at-runtime. It only covers the parser generation, not the other details specific to this question. And it has no answers of the form "yes, and here is a link to it". So I'm not claiming this question is redundant. – Steve Jessop Jun 12 '12 at 14:30
  • It looks like you're trying to build a regex machine (correct me if I am wrong). Try using state machine(s) if you are. – dirkgently Jun 12 '12 at 14:41
  • I have a [compiler library](https://github.com/ShabbyX/shCompiler) which you might be interested in. It reads the grammar on the fly, creates grammar tables and caches it, so reruns are fast. The input format is not BNF, and you have to do the semantic analysis through action routines. I wrote it quick and dirty in C++, so the implementation may not look too beautiful. It accepts ambiguous grammars, but you can only resolve during table creation, not when parsing. In a few months, I will finish rewriting it in C with a better and faster substructure, but its usage would not be too different. – Shahbaz Jun 12 '12 at 14:44
  • 1
    @dirkgently No, it's not regular grammars (which can be described by regular expressions), it's context-free grammars. Although, he could use a state machine _with a stack_. – Eric Finn Jun 12 '12 at 15:15
  • @dirkgently: regexes *are* an option, but we will only go down that road if not better option is available. I would prefer decorrelating the parse from the extraction, if possible, so that the parse is specified one and less savvy persons may specify the extraction; reusing it. – Matthieu M. Jun 12 '12 at 15:23
  • @Matthieu: it strikes me that you could perhaps "compile" your grammar + extraction *to regexes* (where by regexes, I mean what real regex libraries accept, they're in fact more liberal than regular grammars, and abuse the terminology). This could still allow the extraction to be a separate input, in simpler language and hence doable by less savvy people. But the extraction rules could influence what regex(es) you end up with. Then you'd compile the regex(es), of course. There must be subsets of BNF that are easy to interpret as regexes, I'm just not sure exactly what. – Steve Jessop Jun 12 '12 at 15:48
  • @MatthieuM.: Note that I used the term *regex* loosely. My emphasis was on a state machine; my personal experience with state machines vis-a-vis regex engines has been heavily biased towards state machines simply for superior performance and the fact that you can define transitions at runtime and still be able to parse just about any well-formed input. Note that this is a common enough occurrence though: in almost every project there comes a time when you're stuck inventing a DSL (lo' and behold here I go off again :P). – dirkgently Jun 12 '12 at 15:58
  • Do you think embedding a Lua/JavaScript interpreter and letting the user play around with it (in a commonly understood language) would be any better? It depends on your user class though and their skillset. (Chances are this will be an overkill!) – dirkgently Jun 12 '12 at 15:59
  • @dirkgently: heck, replace the BNF with some rules to construct a DOM tree from the input data. And replace the "query to select some portion of it" with jQuery (or XPath). Everyone loves jQuery (or XPath)! And you can blame any performance issues on your javascript engine (or saxon). – Steve Jessop Jun 12 '12 at 16:15
  • @dirkgently: no, embedding any interpreter or JIT compiler is not an option (I had thought about it :p). – Matthieu M. Jun 12 '12 at 17:59
  • @SteveJessop: I understand what you mean. I had thought about the matcher carrying the XPath query to feed it directly rather than building a full fledged AST in-memory. I am a bit afraid that a translation to regexes would turn into a maintenance nightmare, even with a good test-suite: regexes are mostly write-only, so decoding computer-generated for debugging purposes... it's subjective but I am afraid of what it might incur! – Matthieu M. Jun 12 '12 at 18:03
  • @MatthieuM: it's true, but the way I try to think of it is that LLVM bytecode isn't human-readable either, and we don't mind using that as a transitional phase in compilation. Of course, if I was the one writing the C++ to LLVM part of the toolchain, I'd rapidly get a lot better at reading the bytecode. The trouble with regexes, mind you, is that they aren't human-readable *and* it's hard to get a good disassembler for them ;-) – Steve Jessop Jun 12 '12 at 18:05

2 Answers2

2

I have a proprietary solution that can convert grammar source into an in memory representation. The result is a pure data structure. Any code can use it. I also have C++ class that actually implements the parser. Rule handlers are implemented as virtual methods.

The primary difference between our solution and YACC/Bison is that no C/C++ code is generated. This means that grammar can be reloaded without recompiling the app. The grammar can be annotated with application IDs that are used in the code of the rule handlers.

Rais Alam
  • 6,970
  • 12
  • 53
  • 84
Kirill Kobelev
  • 10,252
  • 6
  • 30
  • 51
  • Thanks for the answer, I'll need some time to check it out exactly, in the meantime +1 for providing a promising answer already. – Matthieu M. Jun 14 '12 at 07:50
  • I looked at the site but could not find the corresponding code. Is it possible to view some documentation/API of this parser ? As it is, it is hard for me to pronounce myself on this... – Matthieu M. Jun 15 '12 at 08:44
  • Matthieu, give my your email address. I will send you some info. The source language there is similar to YACC, but not exactly the same. Example of the source grammar is at http://www.cdsan.com/GraCpp_Grammar.php. The source grammar can be take both from from the disk file and from the memory buffer. – Kirill Kobelev Jun 15 '12 at 13:57
  • Thanks, I had not seen the grammars :) I have sent an e-mail via the website contact form. – Matthieu M. Jun 15 '12 at 14:08
  • Matthieu, please be reasonable. I was speaking not about the grammars in general, that you have seen many times but about example in a particular grammar definition language that has sections (, , , etc), features that describe conflicts. I am not sure there is any other grammar definition language that has a concept of expected grammar conflicts. – Kirill Kobelev Jun 15 '12 at 16:45
  • I did not understood this last comment, sorry. Which conflicts are you talking about ? – Matthieu M. Jun 15 '12 at 17:40
  • It is hard to explain the concept of the grammar conflict here. You can find brief into in http://en.wikipedia.org/wiki/LL_parser. Simple grammars like a grammar that is needed for parsing expressions like "(5+3)*7" do not have conflicts. More complex grammars, like a C++ grammar have them. Bison is well suited only for grammars without conflicts. My grammar definition parser is much more advanced in this area. – Kirill Kobelev Jun 15 '12 at 20:08
  • Ah, thanks for the heads up. I am unsure yet as to whether the grammars I will work with will have conflicts or not (the first few I have seen did not). – Matthieu M. Jun 15 '12 at 20:32
1

The GOLD parser system produces an LALR parse table that is apparantly loaded AFAIK at runtime. I believe it has a C++ "parsing" engine so that should be easy to integrate.

You'd read your grammar, fork a subprocess to get the GOLD parser generator to produce the table, and then call your wired-in GOLD parser to load-and-parse.

I don't know how you attach actions to the reductions, which you'd probably like to do. I have no specific experience with GOLD. "Gold" luck to you.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • I would like to be able to attach "validation" actions to some items, if possible, but this could be dealt with afterward otherwise. As long as I can efficiently extract information, I'll have external code to deal with the information. – Matthieu M. Jun 14 '12 at 07:49