0

I looked through the Artima guide on parser combinators, which says that we need to append failure(msg) to our grammar rules to make error-reporting meaningful for the user

def value: Parser[Any] =
    obj | stringLit | num | "true" | "false" | failure("illegal start of value")

This breaks my understanding of the recursive mechanism, used in these parsers. One one hand, Artima guide makes sense saying that if all productions fail then parser will arrive at the failure("illegal start of value") returned to the user. It however does not make sense, nevertheless, once we understand that grammar is not the list of value alternatives but a tree instead. That is, value parser is a node that is called when value is sensed at the input. This means that calling parser, which is also a parent, detects failure on value parsing and proceeds with value sibling alternative. Suppose that all alternatives to value also fail. Grandparser will try its alternatives then. Failed in turn, the process unwinds upward until the starting symbol parser fails. So, what will be the error message? It seems that the last alternative of the topmost parser is reported errorenous.

To figure out, who is right, I have created a demo where program is the topmost (starting symbol) parser

import scala.util.parsing.combinator._

object ExprParserTest extends App with JavaTokenParsers {

    // Grammar
    val declaration = wholeNumber ~ "to" ~ wholeNumber | ident | failure("declaration not found")
    val term = wholeNumber | ident ; lazy val expr: Parser[_] = term ~ rep ("+" ~ expr)
    lazy val statement: Parser[_] = ident ~ " = " ~ expr | "if" ~ expr ~ "then" ~ rep(statement) ~ "else" ~ rep(statement)
    val program  = rep(declaration) ~ rep(statement)

    // Test
    println(parseAll(program, "1 to 2")) // OK
    println(parseAll(program, "1 to '2")) // failure, regex `-?\d+' expected but `'' found at '2
    println(parseAll(program, "abc")) // OK


}

It fails with 1 to '2 due to extra ' tick. Yes, it seems to stuck in the program -> declaration -> num "to" num rule and does not even try the ident and failure("declaration not found") alternatives! I does not back track to the statements either for the same reason. So, neither my guess nor Artima guide seems right on what parser combinators are actually doing. I wonder: what is the real logic behind rule sensing, backtracking and error reporting in parser combinators? Why does the error message suggests that no backtracking to declaration -> ident | failure(), nor statements occured? What is the point of Artima guide suggesting to place failure() in the end if it is not reached as we see or ignored, as the backtracking logic should be, anyway?

Isn't parser combinator just a plain dumb PEG? It behaves like predictive parser. I expected it is PEG and, thus, that starting symbol parser should return all failed branches and wonder why/how does the actual parser manage to select the most appropriate failure.

1 Answers1

0

Many parser combinators backtrack, unless they're in an 'or' block. As a speed optimization, they'll commit to the 1st successful 'or' item and not backtrack. So 1) try to avoid '|' as much as possible in your grammar, and 2) if using '|' is unavoidable, place the longest or least-likely-to-match items first.

Mark
  • 597
  • 5
  • 8
  • 1
    You contradict yourself. Saying that parsing failed after first alternative succeeded makes no sense. If parsing failed then alternative was tried but did not succeed. – Little Alien Jun 20 '16 at 23:48
  • @LittleAlien-if a subsequent item fails, you can't backtrack through the successful or to try other pahts. – Mark Jun 28 '16 at 04:25
  • I want this to be more pronounced. Do you mean that if `a` is taken from `(a ~ b ~ c) | d | e` then first branch is take inrreversibly? – Little Alien Jun 28 '16 at 22:44