1

I've got a problem with an ambiguous parse in insta. Here's the grammar:

(def yip-shape
  (insta/parser
   (str/join "\n"
             ["S = ( list-item | heading | text-block )*"

              ;; lists and that
              "list-item = list-level <ws> anything"
              "list-level = #' {0,3}\\*'"

              ;; headings
              "heading = heading-level <ws> ( heading-keyword <ws> )? ( heading-date <ws> )? anything <eol?>"
              "heading-level = #'#{1,6}'"
              "heading-date = <'<'> #'[\\d-:]+' <'>'>"
              "heading-keyword = 'TODO' | 'DONE'"

              "text-block = anything*"

              "anything = #'.+'"
              "<eol> = '\\r'? '\\n'"
              "<ws> = #'\\s+'"])))

The problem is with a heading like ## TODO Done - I can understand why the ambiguity exists, I'm just not sure of the best way to solve it. E.G

(insta/parses yip-shape "## TODO Done.")

Produces:

([:S [:text-block [:anything "## TODO Done."]]] 
 [:S [:heading [:heading-level "##"] [:anything "TODO Done."]]] 
 [:S [:heading [:heading-level "##"] [:heading-keyword "TODO"] [:anything "Done."]]])

The last of which is the result I'm looking for. How best to eliminate the ambiguity and narrow the result down to the last one in that list?

Phil Jackson
  • 456
  • 3
  • 10

2 Answers2

2

Grammars are for parsing structured data. If you take an otherwise-reasonable grammar and throw an "any old junk" rule into it, you will get a lot of parses that involve any old junk. The way to resolve the ambiguity is to be more stringent about what qualifies in your "anything" rule, or better yet to remove it entirely and instead actually parse the stuff that goes there.

amalloy
  • 89,153
  • 8
  • 140
  • 205
0

One option is to tweak the regular expression for "anything" to allow any character except #. That way it only eats text up until the next # character.

Another option is to tweak the regular expression for "anything" to not allow a # as the first character, and not allow a newline as any character. Also would probably want to change textblock to be (anything | eol)*. So in this case "anything" will eat all the way up to the newline character, basically allowing textblock to process text one line at a time. When you hit a line beginning with a #, it won't get picked up by "anything" but will get picked up by the other rules instead.

It really depends on the behavior you want, but these are some strategies for making your description of "anything" more precise.

puzzler
  • 316
  • 1
  • 7