0

I would like to use Lark to generate a standalone parser for a small rational language of mine. This needs to be a LALR(1) parser.

It should accept the following input:

(lorem) consectetur adipiscing elit

(lorem) [ipsum dolor sit amet] consectetur adipiscing elit

My best guess for the grammar (note: I am a complete beginner in parsing):

start : (blank_lines | line)*
blank_lines : /^([ \t]*\n)+/m
line : "(" head ")" ("[" option "]")? tail "\n"
head : /\w+/
option : TEXT
tail: TEXT
TEXT : /[^\[\]\n]+/

%ignore /[ \t]+/

This works with Lark's Earley parser, but fails with LALR(1) (you can test that on https://www.lark-parser.org/ide/).

More precisely, LALR(1) accepts the first lorem-line, but fails on the second one with:

(lorem) [ipsum dolor sit amet] consectetur adipi
        ^
Expected one of: 
    * NEWLINE

Previous tokens: Token('TEXT', ' ')

(Obviously, if I suppress the ? in the definition of line, it fails on the first one and succeeds on the second one.)

Ok, let's replace the definition of TEXT by:

TEXT : /[^ \[][^\[\]\n]*/

Now it gives the expected result, both with LALR(1) and Earley. I thought specifying %ignore /[ \t]+/ would have made this useless.

Is there a better way to write this grammar?

Aristide
  • 3,606
  • 2
  • 30
  • 50

2 Answers2

1

You have an ambiguity between the TEXT terminal and the %ignore terminal. Lark does not necessarily gurantee how this behaves. However, in general it will prefer using the terminal that is not ignored to actually make progress while parsing.

You need to make sure this ambiguity does not exists, which you are doing with your changed definition of TEXT.

MegaIng
  • 7,361
  • 1
  • 22
  • 35
0

Answering my own question.

For some reason, ignoring /[ \t]+/ is not equivalent to ignoring (" "|/\t/)+ (which is defined as WS_INLINE in common.lark).

Replacing the former expression by the latter in my first version was enough to make it accept the input. But it also helps to produce a version that I think is slightly better (no more explicit negated class):

start : (blank_lines | line)*
blank_lines : /^([ \t]*\n)+/m
line : "(" head ")" ("[" option "]")? tail _NL
head : /\w+/
option : /.+(?=\])/
tail : /(?!\[).+/

%import common.NEWLINE -> _NL
%import common.WS_INLINE
%ignore WS_INLINE
Aristide
  • 3,606
  • 2
  • 30
  • 50
  • 1
    Changing the length of the regexp just shifts the default priority. But you should just set an explicit priority on ignored tokens, if you want them to be matched first. Something like `MY_WS.10: WS_INLINE` and then `%ignore MY_WS` – Erez May 23 '23 at 08:24
  • Ah, it makes sense. Thanks for this tip, and of course for Lark, which I really appreciate. – Aristide May 24 '23 at 11:46