5

What I would like to do is to parse an expression such this one:

result = A + B + sqrt(B + 4)

Where A and B are columns of a dataframe. So I would have to parse the expresion like this in order to get the result:

new_col = df.B + 4
result = df.A + df.B + new_col.apply(sqrt)

Where df is the dataframe.

I have tried with re.sub but it would be good only to replace the column variables (not the functions) like this:

import re

def repl(match):
    inner_word = match.group(1)
    new_var = "df['{}']".format(inner_word)
    return new_var

eq = 'A + 3 / B'
new_eq = re.sub('([a-zA-Z_]+)', repl, eq)
result = eval(new_eq)

So, my questions are:

  • Is there a python library to do this? If not, how can I achieve this in a simple way?
  • Creating a recursive function could be the solution?
  • If I use the "reverse polish notation" could simplify the parsing?
  • Would I have to use the ast module?
ChesuCR
  • 9,352
  • 5
  • 51
  • 114

2 Answers2

9

Pandas DataFrames do have an eval function. Using your example equation:

import pandas as pd
# create an example DataFrame to work with
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
# define equation
eq = 'A + 3 / B'
# actual computation
df.eval(eq)

# more complicated equation
eq = "A + B + sqrt(B + 4)"
df.eval(eq)

Warning

Keep in mind that eval allows to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.

ChesuCR
  • 9,352
  • 5
  • 51
  • 114
uuazed
  • 879
  • 10
  • 19
  • Many thanks! It works fine. I would like to use other functions, but I have read this: "The support math functions are sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and arctan2". So I am afraid I can use only those functions. Is possible to add external functions to the expression? With the builtin python [`eval()`](https://docs.python.org/3/library/functions.html#eval) function is possible to use the `local` dictionary to add the functions as objects, but I could not make it work with `df.eval()` – ChesuCR Nov 06 '17 at 14:11
  • Well I have writen [another question](https://stackoverflow.com/questions/47161939/how-can-i-use-a-custom-function-within-an-expression-using-the-eval-dataframe-me) to manage this – ChesuCR Nov 07 '17 at 15:44
  • 1
    Pleas add a caution to this. eval() allows aribtrary code to be run. This can be dangerous if eval is called on a string that is not sanitized! [eval()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eval.html) **This allows _eval_ to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.** – Tom Myddeltyn Dec 01 '20 at 13:29
  • 1
    You are right @Tom. I will add the warning to the answer. Thanks – ChesuCR Dec 01 '20 at 19:25
1

Following the example provided by @uuazed, a faster way would be using numexpr

import pandas as pd
import numpy as np
import numexpr as ne

df = pd.DataFrame(np.random.randn(int(1e6), 2), columns=['A', 'B'])
eq = "A + B + sqrt(B + 4)"
timeit df.eval(eq)
# 15.9 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
timeit A=df.A; B=df.B; ne.evaluate(eq)
# 6.24 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numexpr may also have more supported operations

avelo
  • 11
  • 2
  • It is faster, but you need to know the variables you are going to use in advance. If not, you must make some parsing before the evaluation, and it will take time – ChesuCR Nov 06 '17 at 12:52
  • If you check the [eval documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.eval.html#pandas.eval) the `engine` by default is `numexpr` – ChesuCR Nov 06 '17 at 13:16
  • Yes, good point! Very curious the long timeit difference just by evaluation – avelo Nov 06 '17 at 13:21