11

Edit: I did a first version, which Eike helped me to advance quite a bit on it. I'm now stuck to a more specific problem, which I will describe bellow. You can have a look at the original question in the history


I'm using pyparsing to parse a small language used to request specific data from a database. It features numerous keyword, operators and datatypes as well as boolean logic.

I'm trying to improve the error message sent to the user when he does a syntax error, since the current one is not very useful. I designed a small example, similar to what I'm doing with the language aforementioned but much smaller:

#!/usr/bin/env python                            

from pyparsing import *

def validate_number(s, loc, tokens):
    if int(tokens[0]) != 0:
        raise ParseFatalException(s, loc, "number musth be 0")

def fail(s, loc, tokens):
    raise ParseFatalException(s, loc, "Unknown token %s" % tokens[0])

def fail_value(s, loc, expr, err):
    raise ParseFatalException(s, loc, "Wrong value")

number =  Word(nums).setParseAction(validate_number).setFailAction(fail_value)
operator = Literal("=")

error = Word(alphas).setParseAction(fail)
rules = MatchFirst([
    Literal('x') + operator + number,
])

rules = operatorPrecedence(rules | error , [
    (Literal("and"), 2, opAssoc.RIGHT),
])

def try_parse(expression):
    try:
        rules.parseString(expression, parseAll=True)
    except Exception as e:
        msg = str(e)
        print("%s: %s" % (msg, expression))
        print(" " * (len("%s: " % msg) + (e.loc)) + "^^^")

So basically, the only things which we can do with this language, is writing series of x = 0, joined together with and and parenthesis.

Now, there are cases, when and and parenthesis are used, where the error reporting is not very good. Consider the following examples:

>>> try_parse("x = a and x = 0") # This one is actually good!
Wrong value (at char 4), (line:1, col:5): x = a and x = 0
                                              ^^^
>>> try_parse("x = 0 and x = a")
Expected end of text (at char 6), (line:1, col:1): x = 0 and x = a
                                                         ^^^
>>> try_parse("x = 0 and (x = 0 and (x = 0 and (x = a)))")
Expected end of text (at char 6), (line:1, col:1): x = 0 and (x = 0 and (x = 0 and (x = a)))
                                                         ^^^
>>> try_parse("x = 0 and (x = 0 and (x = 0 and (x = 0)))")
Expected end of text (at char 6), (line:1, col:1): x = 0 and (x = 0 and (x = 0 and (xxxxxxxx = 0)))
                                                         ^^^

Actually, it seems that if the parser can't parse (and parse here is important) something after a and, it doesn't produce good error messages anymore :(

And I mean parse, since if it can parse 5 but the "validation" fails in the parse action, it still produces a good error message. But, if it can't parse a valid number (like a) or a valid keyword (like xxxxxx), it stops producing the right error messages.

Any idea?

Community
  • 1
  • 1
Jonathan Ballet
  • 973
  • 9
  • 21
  • Have a validating parse action for the variable names too. Or have a catch all variable name like `Word(alphas)`, and put a parse action on it, that always raises an exception. – Eike Apr 09 '13 at 12:38
  • Alternatively you could do the validation one level up. Have a parser `Word(alphas) - "==" - Word(nums)` and put a more complex parse action on it, that looks for legal variable names, and ensures the correctness of the numbers. – Eike Apr 09 '13 at 14:02
  • At the moment, that would be a last resort solution :) – Jonathan Ballet Apr 09 '13 at 14:09
  • The reason is simple: The parser backtracks: `"a = 0"` is a complete program. The parser happens to test this hypothesis last, and it also fails because there is more text after `"a = 0"`. This is the reason why the parser expects "end of text". It is basically an implementation detail of `operatorPrecedence`. Insert `-` wherever possible to prevent backtracking, this will make the error messages at least slightly better. – Eike Apr 09 '13 at 17:39
  • I replaced the "+" by "-" in the rules definition in ``MatchFirst()``, but it doesn't change anything :( I don't see other places it could be use :( – Jonathan Ballet Apr 09 '13 at 20:32
  • Improved version of your script: http://pastebin.com/7E4kSnkm However I don't fully understand it. There are two issues: **1.** There seems to be a bug in handling `ParseFatalException`; it is not always fatal. **2.** The `-` operator seems not to work. -- After a bit of additional playing, you should ask Pyparsing's author Paul McGuire, who writes very good explanation if he has time. – Eike Apr 09 '13 at 21:56

1 Answers1

14

Pyparsing will always have somewhat bad error messages, because it backtracks. The error message is generated in the last rule that the parser tries. The parser can't know where the error really is, it only knows that there is no matching rule.

For good error messages you need a parser that gives up early. These parsers are less flexible than Pyparsing, but most conventional programming languages can be parsed with such parsers. (C++ and Scala IMHO can't.)

To improve error messages in Pyparsing use the - operator, it works like the + operator, but it does not backtrack. You would use it like this:

assignment = Literal("let") - varname - "=" - expression

Here is a small article on improving error reporting, by Pyparsing's author.

Edit

You could also generate good error messages for the invalid numbers in the parse actions that do the validation. If the number is invalid you raise an exception that is not caught by Pyparsing. This exception can contain a good error message.

Parse actions can have three arguments [1]:

  • s = the original string being parsed (see note below)
  • loc = the location of the matching substring
  • toks = a list of the matched tokens, packaged as a ParseResults object

There are also three useful helper methods for creating good error messages [2]:

  • lineno(loc, string) - function to give the line number of the location within the string; the first line is line 1, newlines start new rows.
  • col(loc, string) - function to give the column number of the location within the string; the first column is column 1, newlines reset the column number to 1.
  • line(loc, string) - function to retrieve the line of text representing lineno(loc, string). Useful when printing out diagnostic messages for exceptions.

Your validating parse action would then be like this:

def validate_odd_number(s, loc, toks):
    value = toks[0]
    value = int(value)
    if value % 2 == 0:
        raise MyFatalParseException(
            "not an odd number. Line {l}, column {c}.".format(l=lineno(loc, s),
                                                              c=col(loc, s)))

[1] http://pythonhosted.org/pyparsing/pyparsing.pyparsing.ParserElement-class.html#setParseAction

[2] HowToUsePyparsing

Edit

Here [3] is an improved version of the question's current (2013-4-10) script. It gets the example errors right, but other error are indicated at the wrong position. I believe there are bugs in my version of Pyparsing ('1.5.7'), but maybe I just don't understand how Pyparsing works. The issues are:

  • ParseFatalException seems not to be always fatal. The script works as expected when I use my own exception.
  • The - operator seems not to work.

[3] http://pastebin.com/7E4kSnkm

Chris Connett
  • 147
  • 1
  • 7
Eike
  • 2,205
  • 17
  • 10
  • It only slightly helps: not an even number (at char 0), (line:1, col:1): x == 1 and y == 1 (whereas the error is on "y") – Jonathan Ballet Apr 09 '13 at 09:07
  • Yes, it's a tricky area, I'm struggling to get good error messages too. – Eike Apr 09 '13 at 09:31
  • One problem IMHO is `operatorPrecedence` it rewrites `rules` and returns a complicated parser, that can really parse the expression. The quality of the error messages mainly depends on the implementation of `operatorPrecedence` and less on your code. – Eike Apr 09 '13 at 09:42
  • Basically you have to design your language, to get good error reporting. For good error messages it helps to have a really silly language of the kind: `var a as Int; let a = 2;` – Eike Apr 09 '13 at 09:45
  • My language is super simple in essence: 7 comparison operators (=, !=, <, <=, >, >=, and in), a few boolean operators (not, and, or, xor) and they it's just "KEYWORD OP VALUE" combined together. So this is really super simple. Values can be a bit tricky, but I've a whole validation framework that I plugged in, and it works great, except when the values aren't good, where I get the kind of errors in my OP :/ – Jonathan Ballet Apr 09 '13 at 09:58
  • If I can get good errors on my original example, that would already be great. I saw operatorPrecedence was doing a whole tons of stuff, i'm not sure if I can get better error messages with something else. – Jonathan Ballet Apr 09 '13 at 10:00
  • Raising a custom exception like you did would do the trick it seems... Good idea! – Jonathan Ballet Apr 09 '13 at 12:16