1

I'm starting with Lark and got stuck on an issue with parsing special characters.

I have expressions given by a grammar. For example, these are valid expressions: Car{_}, Apple3{3+}, Dog{a_7}, r2d2{A3*}, A{+}... More formally, they have form: name{feature} where

  • name: CNAME
  • feature: (DIGIT|LETTER|"+"|"-"|"*"|"_")+

The definition of constants can be found here.

The problem is that the special characters are not present in produced tree (see example below). I have seen this answer, but it did not help me. I tried to place ! before special characters, escaping them. I also enabled keep_all_tokens, but this is not desired because then characters { and } are also present in the tree. Any ideas how to solve this problem? Thank you.

from lark import Lark

grammar = r"""
    start: object

    object : name "{" feature "}" | name

    feature: (DIGIT|LETTER|"+"|"-"|"*"|"_")+
    name: CNAME

    %import common.LETTER
    %import common.DIGIT
    %import common.CNAME
    %import common.WS
    %ignore WS
"""

parser = Lark(grammar, parser='lalr',
                   lexer='standard',
                   propagate_positions=False,
                   maybe_placeholders=False
                   )
def test():
    test_str = '''
        Apple_3{3+}
    '''

    j = parser.parse(test_str)
    print(j.pretty())

if __name__ == '__main__':
    test()

The output looks like this:

start
  object
    name    Apple_3
    feature 3

instead of

start
  object
    name    Apple_3
    feature 
      3
      +
Matho
  • 295
  • 5
  • 14

1 Answers1

1

You said you tried placing ! before special characters. As I understand the question you linked, the ! has to be replaced before the rule:

!feature: (DIGIT|LETTER|"+"|"-"|"*"|"_")+

This produces your expected result for me:

start
  object
    name    Apple_3
    feature
      3
      +
  • Thank you for your response. The solution you proposed works only in some cases. Once a `LETTER` is present in `feature` expression, an exception is thrown: e.g. for `test_str = "Apple_3{3+a}"` the following exception `lark.exceptions.UnexpectedToken: Unexpected token Token(CNAME, 'a')` – Matho Feb 07 '20 at 09:34
  • 1
    @Matho, his answer is correct. The exception you're getting is because of a collision between `LETTER` and `CNAME`. This is because you're using `lexer="standard"`, which the documentation explicitly states is there only for legacy. So either remove it and use the default lexer, or fix the priority manually. – Erez Feb 07 '20 at 09:58
  • @Erez yes you are right, that is something I didn't notice. Sorry and thank you both. – Matho Feb 07 '20 at 11:08