Parsing formulas using Lark / EBNF

Question

I am working on parsing formulas written in an internal syntax. I am working with Lark. Its the first time im doing this, please bear with me.

The formulas look something like this:

MEAN(1,SUM({T(F_01.01)R(0100)C(0100)S(AT)[T-1Y]},{T(F_01.01)R(0100,0120)C(0100)S(AT)[T-1Y]})))

In a first step I would like to convert the above into something like this:

MEAN(1,SUM(F_01.01_r0100_c0100_sAT[T-1Y],F_01.01_r0100_c0100_sAT[T-1Y],F_01.01_r0120_c0100_sAT[T-1Y])))

Here an example of the code:

from lark import Lark,Transformer

grammar = """

?start:
    | NUMBER
    | [symbols] datapoints ([symbols]+ datapoints)* [symbols]

?symbols.1:
    | /\+/
    | /\-/
    | /\//
    | /\*/
    | /\*\*/
    | /\,/
    | /\(/
    | /\)/
    | /\w+/

?datapoints.2:
           | "{" "T" "(" TABLE ")" [ "R" "(" ROW ")"] ["C" "(" COLUMN ")"] ["S" "(" SHEETS  ")"] [TIME_SHIFT] "}"   -> its_data_point
           | "{" "SPE.DPI" "(" CNAME ")" [TIME_SHIFT] "}"    -> ste_data_point

TIME_UNIT: "M" | "Q" | "Y"
TIME_SHIFT: /\[T\-/ INT TIME_UNIT /\]/ | /\[PYE\]/

TABLE: /[A-Z]{1}/ "_" (/\d{3}/ | /\d{2}/) "." /\d{2}/ ["." /[a-z]/]
ROW:  /\d{4}/ (/\,\d{4}/)*
COLUMN: /\d{4}/ (/\,\d{4}/)*
SHEETS: /[a-zA-T0-9_]+/ ("," /a-zA-T0-9_/)*

OTHER: /[a-zA-Z]+/

%import common.WS_INLINE
%import common.INT
%import common.CNAME
%import common.NUMBER

%ignore WS_INLINE

"""

sp = Lark(grammar)

class MyTransFormer(Transformer):

    def __init__(self):
        self.its_data_points = []

    def its_data_point(self,items):
        t,r,c,s,ts=items
        res = []
        for row in r.split(','):
            res.append(str(t)+'_r'+ str(row)+'_c'+str(c)+'_s'+str(s)+str(ts))
        self.its_data_points += res
        return ','.join(res)

    def __default_token__(self, token):
        return str(token.value)

    def __default__(self, data, children, meta):
        return ''.join(children)

teststr="MEAN(1,SUM({T(F_01.01)R(0100,0120)C(0100)S(AT)[T-1Y]},{T(F_01.01)R(0100)C(0100)S(AT)[T-1Y]}))"
tree = sp.parse(teststr)
mt = MyTransFormer()
print(mt.transform(tree))

but with this i get:

MEANMEAN(1,SUM(F_01.01_r0100_c0100_sAT[T-1Y],F_01.01_r0120_c0100_sAT[T-1Y],F_01.01_r0100_c0100_sAT[T-1Y]))

why do I get a 'mean' twice ?

This grammar does not parse that example string, so it's not really possible to help you. — MegaIng, May 30 '23 at 20:24
Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, May 31 '23 at 06:54
I have amended the question to include a working example. I am puzzeled why i get the first part of the formula twice. — VicVic, Jun 01 '23 at 10:10

score -1 · Answer 1 · answered Jun 01 '23 at 11:08

The problem is that your grammar is written in such an ambigous way that the default Lark amibuity resolver get's messed up and duplicates terminals. That shouldn't happen from the library point of view and I think there is already an issue open for something like that.

However, there is a really simple fix of rewritting the grammar to be far less ambigous:

?start: NUMBER
      | (symbols|datapoints)*

?!symbols: "+" | "-" | "*" | "**" | "," | "(" | ")" | /\w+/

?datapoints: "{" "T" "(" TABLE ")" [ "R" "(" ROW ")"] ["C" "(" COLUMN ")"] ["S" "(" SHEETS  ")"] [TIME_SHIFT] "}"   -> its_data_point
           | "{" "SPE.DPI" "(" CNAME ")" [TIME_SHIFT] "}"    -> ste_data_point

I took away that symbols and datapoints could be empty. For otherwise fixed size rules this is better expressed in the rule above with an optional marker, i.e. ? or []. In addition, the combination of symbols and datapoints you had in the second line of start boils down to any combination of symbols and datapoints in any order. Not sure if that is what you wanted, but simplified like this it gets parsed correctly.

You can see that the ambiguity is the problem by passing ambiguity="explicit" to the Lark constructor. Then the parsing doesn't complete because it can't correctly generated the millions of possibilities the original grammar has.

I would suggest to always aim to create a grammar in such a way that parser='lalr' works. For the original one, that raises complains about various ambiguities that you would fix. Although that isn't always possible, but here it probably is.

many thanks for your very useful reply. I am completely knew to syntax parsing and Lark, could you elaborate a little on what the exclamation mark in front of symbols does? I didnt find too much documentation. my goal is to take formulas written with a specific syntax and in a first step extract all "datapoints" from them. These are enclosed in curly brackets. In other words, convert stuff in curly brackets, leave the rest alone. — VicVic, Jun 01 '23 at 12:03

Parsing formulas using Lark / EBNF

1 Answers1