1

I would like to extract from a text file only some structured patterns.

example, in the text below:

   blablabla 
   foo FUNC1 ; blabliblo blu

I would like to isolate only 'foo FUNC1 ;'.

I was trying to use lark parser with the following parser

foo=Lark('''
  start:  statement*
  statement: foo 
           | anything
  anything : /.+/
  foo : "foo" ID ";"
  ID : /_?[a-z][_a-z0-9]*/i
  %import common.WS
  %import common.NEWLINE
  %ignore WS
  %ignore NEWLINE
''',
parser="lalr" ,
propagate_positions=True)

But the token 'anything' captures all. Is there a way to make it not greedy ? So that the token 'foo' can capture the given pattern ?

Pierre G.
  • 4,346
  • 1
  • 12
  • 25

1 Answers1

1

You could solve this with priorities.

For parser="lalr", Lark supports priorities on terminals. So you could move "foo" into its own terminal and then assign that terminal a higher priority than the anything terminal (which has default priority 1):

  foo : FOO ID ";"
  FOO.2: "foo"

Parsing your example string then results in:

start
  statement
    anything    blablabla 
  statement
    foo
      foo
      FUNC1
  statement
    anything    blabliblo blu

For parser="earley", Lark supports priorities on rules, so you could use:

  foo.2 : "foo" ID ";"

Parsing your example string then results in:

start
  statement
    anything    blablabla 
  statement
    foo FUNC1
  statement
    anything     blabliblo blu