1

Everything I've read suggests Treetop backtracks like regular expressions, but I'm having a hard time making that work.

Suppose I have the following grammar:

grammar TestGrammar
  rule open_close
    '{' .+ '}'
  end
end

This does not match the string {abc}. I suspect that's because the .+ is consuming everything from the letter a onwards. I.e. it's consuming abc} when I only want it to consume abc.

This appears different from what a similar regex does. The regex /{.+}/ will match {abc}. It's my understanding that this is possible because the regex engine backtracks after consuming the closing } as part of the .+ and then failing to match.

So can Treetop do backtracking like that? If so, how?

I know you can use negation to match "anything other than a }." But that's not my intention. Suppose I want to be able to match the string {ab}c}. The tokens I want in that case are the opening {, a middle string of ab}c, and the closing }. This is a contrived example, but it becomes very relevant when working with nested expressions like {a b {c d}}.

rlkw1024
  • 6,455
  • 1
  • 36
  • 65

1 Answers1

2

Treetop is an implementation of a Parsing Expression Grammar parser. One of the benefits of PEGs is their combination of flexibility, speed, and memory requirements. However, this balancing act has some tradeoffs.

Quoting from the Wikipedia article:

The zero-or-more, one-or-more, and optional operators consume zero or more, one or more, or zero or one consecutive repetitions of their sub-expression e, respectively. Unlike in context-free grammars and regular expressions, however, these operators always behave greedily, consuming as much input as possible and never backtracking. […] the expression (a* a) will always fail because the first part (a*) will never leave any a's for the second part to match.

(Emphasis mine.)

In short: while certain PEG operators can backtrack in an attempt to take another route, the + operator cannot.

Instead, in order to match nested sub-expressions, you want to create an alternation between the delimited sub-expression (checked first) followed by the non-expression characters. Something like (untested):

grammar TestGrammar
  rule open_close
    '{' contents '}'
  end
  rule contents
    open_close / non_brackets
  end
  rule non_brackets
    # …
  end
end
Phrogz
  • 296,393
  • 112
  • 651
  • 745
  • Thanks! How would you achieve what I'm describing with a PEG? – rlkw1024 Oct 17 '12 at 05:09
  • @Jarrett See my edit. Forgive what may be horribly broken syntax; it's been years since I last used Treetop. Note that the Wikipedia entry discusses this specifically (search for "Pascal-style nested comment syntax" on the page). – Phrogz Oct 17 '12 at 05:12
  • Awesome! Thanks! What about matching this string: {ab}c} – rlkw1024 Oct 17 '12 at 05:27
  • Got it. I decided that I just have to regard the first } in {ab}c}, rather than the final one, as the closing bracket. Then, the string c} can be regarded as plain old text. Not exactly what I wanted, but it's a reasonable compromise. I'm not sure, but I'm starting to think this may follow necessarily from the nature of PEGs. – rlkw1024 Oct 17 '12 at 06:06