I would like to use Lark to generate a standalone parser for a small rational language of mine. This needs to be a LALR(1) parser.
It should accept the following input:
(lorem) consectetur adipiscing elit
(lorem) [ipsum dolor sit amet] consectetur adipiscing elit
My best guess for the grammar (note: I am a complete beginner in parsing):
start : (blank_lines | line)*
blank_lines : /^([ \t]*\n)+/m
line : "(" head ")" ("[" option "]")? tail "\n"
head : /\w+/
option : TEXT
tail: TEXT
TEXT : /[^\[\]\n]+/
%ignore /[ \t]+/
This works with Lark's Earley parser, but fails with LALR(1) (you can test that on https://www.lark-parser.org/ide/).
More precisely, LALR(1) accepts the first lorem-line, but fails on the second one with:
(lorem) [ipsum dolor sit amet] consectetur adipi
^
Expected one of:
* NEWLINE
Previous tokens: Token('TEXT', ' ')
(Obviously, if I suppress the ?
in the definition of line
, it fails on the first one and succeeds on the second one.)
Ok, let's replace the definition of TEXT
by:
TEXT : /[^ \[][^\[\]\n]*/
Now it gives the expected result, both with LALR(1) and Earley. I thought specifying %ignore /[ \t]+/
would have made this useless.
Is there a better way to write this grammar?