I have a simple grammar, which parse key-value pairs section by section.
k1:1
k2:x
k3:3
k4:4
The grammar I have for it is:
start: section (_sep section)*
_sep: _NEWLINE _NEWLINE+
section: item (_NEWLINE item)*
item: NAME ":" VALUE
_NEWLINE: /\r?\n[\t ]*/
VALUE: /\w+/
NAME: /\w+/
However, the grammar works when using the earley parser, but not using lalr parser.
with the following code:
from lark import Lark
import logging
from pathlib import Path
logging.basicConfig(level=logging.DEBUG)
my_grammar = Path("my_grammar.lark").read_text()
print(my_grammar)
early = Lark(my_grammar, debug=True)
print(my_grammar)
lalr = Lark(my_grammar, parser='lalr', debug=True)
text = """
k1:1
k2:x
k3:3
k4:4
"""
print(text.strip())
print(early.parse(text.strip()).pretty())
print(lalr.parse(text.strip()).pretty())
the earley parser give me the valid result.
start
section
item
k1
1
item
k2
x
section
item
k3
3
item
k4
4
but lalr parser did not
lark.exceptions.UnexpectedCharacters: No terminal defined for '
' at line 3 col 1
^
Expecting: {'NAME'}
PS: the problem is with the _NEWLINE.
Lark-parser grammar config the lexer and parser in on grammar file. In my grammar above, a line will be tokenized as _NEWLINE. Multiple new line will be tokenized as _NEWLINE.. _NEWLINE. It confuse the parser.
change _sep
to /\r?\n[\t ]*(\r?\n[\t ]*)/
. multiple line will be tokenized as one token. and lalr(1) parser can work on it smoothly.
while I get it working. still curious about how early parser got it right.