0

I'm writing a string parser to do something akin to Patsy

I've got the operators working (:, +, -, /, etc.) but I can't seem to get functions working. I'm only copy-pasting the functions directly related

from ply import yacc, lex


em_data = {'a': ['a1', 'a1', 'a2', 'a2', 'a1', 'a1', 'a2', 'a2'],
           'b': ['b1', 'b2', 'b1', 'b2', 'b1', 'b2', 'b1', 'b2'],
           'x1': [1.76405235, 0.40015721, 0.97873798, 2.2408932, 1.86755799,
                                   -0.97727788, 0.95008842, -0.15135721],
           'x2': [-0.10321885, 0.4105985, 0.14404357, 1.45427351, 0.76103773,
                                   0.12167502, 0.44386323, 0.33367433],
           'y': [1.49407907, -0.20515826, 0.3130677, -0.85409574, -2.55298982,
                                  0.6536186, 0.8644362, -0.74216502],
           'z': [2.26975462, -1.45436567, 0.04575852, -0.18718385, 1.53277921,
                                  1.46935877, 0.15494743, 0.37816252]}

########################################
# define all the tokens we will need
########################################
tokens = (
    # Atomics
    "NAME",  # Feature names
    "NUMBER",  # Numeric numbers

    # Binary Ops
    "RELATIONSHIP",  # y ~ x : end result is a tuple of (y, x)

    # Symbols
    "LPAREN",
    "RPAREN",

    # Functions
    "C"  # Expands vector elements into 1-hot
)

########################################
# Building the regexps
########################################

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

t_ignore = ' '

# Atomics
t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'

def t_NUMBER(t):
    r'\d+'
    t.value = str(t.value)
    return t

t_LPAREN = r'\('
t_RPAREN = r'\)'
t_C = r'C'

########################################
# Define the parser
########################################

lex.lex(debug=True)

precedence = (
    ("left", "RELATIONSHIP"),
    ("left", "C")
)

My parser is then

precedence = (
    ("left", "RELATIONSHIP"),
    ("left", "C")
)


def p_expression_group(p):
    "expression : LPAREN expression RPAREN"
    p[0] = p[2]

def p_expression_number(p):
    "expression : NUMBER"
    p[0] = p[1]

def p_expression_name(p):
    "expression : NAME"
    if p[1] not in em_data:
        raise RuntimeError(f"Term: {p[1]} not found in dataset lookup")

    p[0] = p[1]

def p_error(p):
    print(f"Syntax error at {p.value!r}")

def p_C(p):
    "statement : C LPAREN expression RPAREN"
    print("got in here!")


if __name__ == '__main__':
    s = "C(a)"
    yacc.yacc()
    yacc.parse(s)

Some questions:

  1. Does it matter if I put statement or expression in the p_X regex? From what I've read, an expression is something that reduces to a single value which, IMO, implies that a statement cannot be reduced to a single value. E.g x = 5 cannot be reduced? In this case, would a list e.g ["x", "x:z", ...] be a statement or an expression? My intuition says it's an expression but I want to be sure

  2. When running all the above code, I run into Syntax error at '('. I'm not sure WHY this is happening. I don't see any reason why it SHOULD

IanQ
  • 1,831
  • 5
  • 20
  • 29
  • 1
    Re #2: *Which* `(`? – Scott Hunter Mar 13 '21 at 18:26
  • I'm really not sure... I ASSUME it's the one in the string, `s`; when I use s without the call to `C` there is no error. I've got debug mode turned on but I'm having trouble understanding the `parser.out` file. Might it help to paste it here? – IanQ Mar 13 '21 at 18:34

1 Answers1

1

You've defined two non-terminals in your grammar: expression and statement. When you call parse it will parse your start rule. If you don't set a start rule explicitly using start = 'rule_name', that will be the first non-terminal you define in your grammar, i.e. expression.

So your parser is rejecting your input because it's not a valid expression. To make it parse statements instead, you can either move p_C so that it comes first, or you can set start = 'statement'.

sepp2k
  • 363,768
  • 54
  • 674
  • 675
  • 1) Hmm, when I put `p_C` first (I assume you mean before all the other `p_X`s) I now get `Syntax error at 'C'`. 2) Can you expand on how I would "undefine" those non-terminals? Also, how does `start = ` work if I don't exactly know what it will start with? It could be a statement or a variable (also, I ran into `Syntax error at 'C' when trying this) – IanQ Mar 13 '21 at 18:44
  • @IanQuah Ah, I haven't looked at your lexer before. The problem now is that your lexer recognizes `C` as a `NAME`, not as a `C`. If you really want to allow arbitrary function names, but you just started with `C` to keep it simple, it'd be easiest to just remove the `C` token and change the rule to `NAME LPAREN expression RPAREN`. But if you really do want to only allow `C` as a function name, you should [read this on how to handle reserved words vs. identifiers in PLY](https://www.dabeaz.com/ply/ply.html#ply_nn6). – sepp2k Mar 13 '21 at 19:00