11

I see umpteen posts a day about "how to do X with regexen". And the best response to most of them seems like it would honestly be, "Why are you trying to drive a screw with a hammer?" But regexen are everywhere, and the syntax is mostly portable, particularly if you keep away from the fancy bits.

Is there anything equivalent to regexen but at the next level up in power and configurability? A "you can use it anywhere" parsing library of some variety, preferably with a gloriously concise DSL as its interface?

I've used Ragel somewhat, but because of the preprocessing step, I'd hesitate to recommend it to someone as "use this instead of some hairy regex". It's awkward to use from Obj-C, and I expect it will be terribly awkward from a language that doesn't have compile-link-run as part of its standard operating procedure.

What I'm looking for is something that will pass the "inline-online-universal" test.

  1. (inline) You can write the notation inline with your other code, as you would with a regex..

  2. (online) You can run the resulting parser just as you would your other code, which would mean right after input to a REPL in the case of something like Python.

  3. (universal) You can move to a different language/platform and use virtually the same code for your parser, modulo dialect differences. In reality, I'd be happy with something that works from Python, Ruby, C, Java, and Haskell.

Most tools I know of fall down at "online". They preprocess a grammar offline and spit out code in the target language (C, Python, Java, C++…). They're standalone tools that aren't themselves integrated into the language environment.

I've had suggestions of PEG parsers and lex/yacc combos. Parser combinator libraries might also be a good fit. Whatever you might propose, I'd like to see demonstrated that it meets these tests. Your answer should demonstrate that the proposed solution meets the inline-online-universal requirements by providing a working demo parser in Python, C, and Haskell. The demo example is up to the author, but it should be something painful using just regexen but trivial using a proper parser.

Jeremy W. Sherman
  • 35,901
  • 5
  • 77
  • 111
  • 1
    Similar to, but differs in focus from, http://stackoverflow.com/questions/803515/why-do-on-line-parsers-seem-to-stop-at-regexps. – Jeremy W. Sherman Sep 18 '12 at 20:01
  • 5
    Regex *is* the gloriously concise DSL. – Jay Sep 18 '12 at 20:16
  • 5
    For those like me who had no idea, "regexen" is apparently a plural form of "regex," in addition to "regexes." – Andrew Cheong Sep 18 '12 at 20:31
  • 3
    The reason why there are so many of those responses is that generally there already is a parser for the specific language included. This is especially true for anything XML based (and older HTML of course). anything else will probably require BNF or something similar. – Maarten Bodewes Sep 18 '12 at 20:37
  • 2
    @acheong87 It's an entrenched linguistic gag at this point. I didn't realize it'd make this hard to read for some folks. Programmerese as I speak it admits -xen to form the plural of nouns ending in -x. Ox -> oxen, vax -> vaxen, unix -> unixen, regex -> regexen. I even use Publixen (Publix is a supermarket in the Southeastern US) at times. – Jeremy W. Sherman Sep 18 '12 at 20:50
  • 1
    Most of the PEG parser generators implement pretty much the same syntax, so it should be enough for a first approximation. – SK-logic Oct 01 '12 at 12:07
  • I agreed with Jay - Once you've read (and studied): [Mastering Regular Expressions (3rd Edition)](http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124 "By Jeffrey Friedl. Best book on Regex - ever!"), and can _think in regex_, you won't need anything more! – ridgerunner Oct 16 '12 at 16:26
  • I doubt such a thing exists. Even if it does there will not be the support community or tooling that there is for regexen(e.g. [Expresso](http://www.ultrapico.com/Expresso.htm)). – Ryan Gates Oct 17 '12 at 17:31
  • @RyanGates The support and tooling exists for some two-step parser dev systems, like [ANTLRWorks](http://www.antlr.org/works/index.html) for [ANTLR](http://www.antlr.org). Besides, I'm not looking for community, just one thing usable now, without a separate build-and-run step, across several platforms. – Jeremy W. Sherman Oct 17 '12 at 18:57
  • Heh, I would want $250,000 for doing this, not +250 reputation. – Tyler Durden Oct 19 '12 at 19:05
  • The [/r/haskell Reddit discussion](http://www.reddit.com/r/haskell/comments/11m1xo/portability_of_parser_combinator_approach/) of the possibility of using parser combinators makes a good supplement to this question. – Jeremy W. Sherman Nov 30 '12 at 22:39

2 Answers2

1

https://github.com/leblancmeneses/NPEG

Implements PEG.

Meets all 3 ... let me explain.

It is inline only with C# and offline with all the others. C# has an offline version also.

I currently support offline versions: C/C++/Javascript (local right now)/Java pass all unit tests - to make it universal. To add another language takes 25.84 hrs (how long it took to create the offline Javascript version)

To make it online for every language would be to much maintenance(possible) but it took me a lot of work and time just to support the current offline versions. I can now focus my energy on building grammar optimizers and tooling to unit test grammar rules where all offline versions benefit.

Leblanc Meneses
  • 3,001
  • 1
  • 23
  • 26
  • Looks like we have a winner: PEGs and packrat parsers. The README provides examples of parsing the standard expression/product/sum grammar in several languages (C/C++/C#/Java), including an inline parser in C#. – Jeremy W. Sherman Oct 19 '12 at 20:15
  • http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-language-workbench – Leblanc Meneses Oct 20 '12 at 06:50
0

Have a look at Lex/Yacc or their counterparts Flex/Bison (or Coco, or all the other "compiler" generators). The combination can be used to parse complex textual data with an (arguably) much more readable syntax than with regexen.

For simple problems though, where regexen are more than sufficient, by any means do use them.

  • I mentioned Ragel in my question, and am no stranger to lex/yacc/ANTLR/Lemon/etc. These all fail the "inline-online-everywhere" requirement: (1) You can write the notation inline with your other code, as you would with a regex, (2) you can run it just as you would your other code, which would mean right after input to a REPL in the case of something like Python, and (3) you can switch to a completely different language/platform and use virtually the same code for your parser, modulo dialect differences. I'll extend the question to clarify this. – Jeremy W. Sherman Oct 13 '12 at 00:29