How can properly match substrings using Lark?
My intention (maybe it's not possible/advisable with Lark, or any CFG) is to match and parse only important parts of the string, ignoring the rest. For example:
- From "John", I'd like to parse "John" as
one_person
. - From "Yesterday John slept", I'd like to parse "John" as
one_person
and ignore the rest. - From "John and Mary", I'd like to parse "John and Mary" as
two_people
. - From "Yesterday John and Mary kissed.", I'd like to parse "John and Mary" as
two_people
and ignore the rest.
Here's my code:
from lark import Lark, tree
grammar = """
input: not_important* important not_important*
important: one_person
| two_people
one_person: PERSON
two_people: one_person conj one_person
not_important: RANDOM_WORD
conj: CONJ
PERSON: "John" | "Mary"
CONJ: "and" | "or"
RANDOM_WORD: /\\w+/
%import common.WS
%ignore WS
"""
if __name__ == '__main__':
parser = Lark(grammar, start='input', ambiguity='explicit')
tree = parser.parse('Yesterday John and Mary kissed')
print(tree.pretty())
It works:
- When nothing surrounds the important part, e.g. "John" or "John and Mary".
- Or when only one side of the important part has unimportant stuff, e.g. "John slept" or "John and Mary kissed".
But it doesn't work when unimportant stuff surrounds the important stuff, e.g. "Yesterday John and Mary kissed". In this example, I was hoping to get:
input
not_important Yesterday
important
two_people
one_person John
conj and
one_person Mary
not_important kissed
But I get:
_ambig
input
not_important Yesterday
important
one_person John
not_important and
not_important Mary
not_important kissed
input
not_important Yesterday
important
two_people
one_person John
conj and
one_person Mary
not_important kissed
not_important John
not_important and
That is, not only does Lark see the input as ambiguous, but it also fails the second parse, as two terminals ("John" and "and") are consumed twice.