1

Consider this simple test of the Python Lark parser:

GRAMMAR = '''
start: container*

container: string ":" "{" (container | attribute | attribute_value)* "}"
attribute: attribute_name "=" (attribute_value | container)
attribute_value: string ":" _value ("," _value)*
_value: number | string

attribute_name: /[A-Za-z_][A-Za-z_#0-9]*/

string: /[A-Za-z_#0-9]+/
number: /[0-9]+/

    %import common.WS
    %ignore WS
'''

data = '''outer : {
 inner : {
 }
}'''

parser = Lark(GRAMMAR, parser='lalr')
parser.parse(data)

This works with parser='earley' but it fails with parser='lalr'. I don't understand why. The error message is:

UnexpectedCharacters: No terminal defined for '{' at line 2 col 12

inner : {

This is just an MWE. My actual grammar suffers from the same problem.

C. E.
  • 10,297
  • 10
  • 53
  • 77
  • I'm having a similar problem. What I have not found out so far is how to troubleshoot something like this - is there a way to get Lark to tell you the "path" it took until encountering UnexpectedCharacters? This way, you could maybe find out where in the grammar the parser makes a bad turn or the grammar is badly formed. – Christoph May 24 '19 at 11:56

1 Answers1

2

The reason this fails with LALR, is because it has a lookahead of 1 (unlike Earley, which has unlimited lookahead), and it gets confused between attribute_name and string. Once it matches one of the other (in this case, attribute_name), it's impossible for it to backtrack and match a different rule.

If you use a lower priority for the attribute_name terminal, it will work. For example:

attribute_name: ATTR

ATTR.0: /[A-Za-z_][A-Za-z_#0-9]*/

But the recommended practice is to use the same terminal for both, if possible, so that the parser can do the thinking for you, instead of the lexer. You can add extra validation, if that's required, after the parsing is done.

Both approaches (changing priority or merging the terminals) will solve your problem.

Erez
  • 1,287
  • 12
  • 18