Match substrings in lark

Question

How can properly match substrings using Lark?

My intention (maybe it's not possible/advisable with Lark, or any CFG) is to match and parse only important parts of the string, ignoring the rest. For example:

From "John", I'd like to parse "John" as one_person.
From "Yesterday John slept", I'd like to parse "John" as one_person and ignore the rest.
From "John and Mary", I'd like to parse "John and Mary" as two_people.
From "Yesterday John and Mary kissed.", I'd like to parse "John and Mary" as two_people and ignore the rest.

Here's my code:

from lark import Lark, tree

grammar = """
    input: not_important* important not_important*

    important: one_person
        | two_people

    one_person: PERSON
    two_people: one_person conj one_person
    not_important: RANDOM_WORD

    conj: CONJ

    PERSON: "John" | "Mary"
    CONJ: "and" | "or"
    RANDOM_WORD: /\\w+/

    %import common.WS
    %ignore WS
"""

if __name__ == '__main__':
    parser = Lark(grammar, start='input', ambiguity='explicit')
    tree = parser.parse('Yesterday John and Mary kissed')
    print(tree.pretty())

It works:

When nothing surrounds the important part, e.g. "John" or "John and Mary".
Or when only one side of the important part has unimportant stuff, e.g. "John slept" or "John and Mary kissed".

But it doesn't work when unimportant stuff surrounds the important stuff, e.g. "Yesterday John and Mary kissed". In this example, I was hoping to get:

input
    not_important   Yesterday
    important
      two_people
        one_person  John
        conj    and
        one_person  Mary
    not_important   kissed

But I get:

_ambig
  input
    not_important   Yesterday
    important
      one_person    John
    not_important   and
    not_important   Mary
    not_important   kissed
  input
    not_important   Yesterday
    important
      two_people
        one_person  John
        conj    and
        one_person  Mary
    not_important   kissed
    not_important   John
    not_important   and

That is, not only does Lark see the input as ambiguous, but it also fails the second parse, as two terminals ("John" and "and") are consumed twice.

Why not just search using a regexp? Or if you need something more sophisticated, perhaps Spacy can help you? I think while Lark can "solve this, it is not the best tool for this job, just from what you wrote here. — Erez, Mar 07 '20 at 22:01
Good ideas, but I've tried both and failed ;-) spaCy due to setup costs: it'd need to be running as a separate service to avoid loading it for every request in my application. regexes bc of maintenance: the patterns i need to capture are many and they can be nested, which would be a pest to maintain (and write) as regexes. — Rodrigo, Mar 09 '20 at 09:48

Ushyme · Accepted Answer · 2020-06-02T11:03:39.470

One way to solve this would be to set a higher priority for some of your rules:

from lark import Lark, tree

grammar = """
    input: not_important* important not_important* 

    important.2: one_person
        | two_people

    one_person: PERSON
    two_people.2: one_person conj one_person
    not_important: RANDOM_WORD

    conj.2: CONJ

    PERSON: "John" | "Mary"
    CONJ: "and" | "or"
    RANDOM_WORD: /\\w+/

    %import common.WS
    %ignore WS
"""

if __name__ == '__main__':
    parser = Lark(grammar, start='input')
    tree = parser.parse('Yesterday John and Mary kissed')
    print(tree.pretty())

Notice the .2 added to important, two_people and conj, this is how you set priority in lark. By default, all rules have priority 1. With the priorities set, you can remove ambiguity='explicit', as lark will be able to deal with the ambiguities correctly.

The second approach is to use lalr as the parser. This allows you set priorities on terminals. Then we can set PERSON and CONJ's priority to 2. This will make the grammar unambiguous, which is a requirement for using lalr


from lark import Lark, tree

grammar = """
    input: not_important* important not_important* 

    important: one_person
        | two_people

    one_person: PERSON
    two_people: one_person conj one_person
    not_important: RANDOM_WORD

    conj: CONJ

    PERSON.2: "John" | "Mary"
    CONJ.2: "and" | "or"
    RANDOM_WORD: /\\w+/

    %import common.WS
    %ignore WS
"""

if __name__ == '__main__':
    parser = Lark(grammar, start='input', parser="lalr")
    tree = parser.parse('Yesterday John and Mary kissed')
    print(tree.pretty())

Both approaches output

input
  not_important Yesterday
  important
    two_people
      one_person        John
      conj      and
      one_person        Mary
  not_important kissed

Match substrings in lark

1 Answers1