I have a spark dataframe which looks like this where expr is SQL/Hive filter expression.
+-----------------------------------------+
|expr |var1 |var2 |
+-------------------------+---------+-----+
|var1 > 7 |9 |0 |
|var1 > 7 |9 |0 |
|var1 > 7 |9 |0 |
|var1 > 7 |9 |0 |
|var1 = 3 AND var2 >= 0 |9 |0 |
|var1 = 3 AND var2 >= 0 |9 |0 |
|var1 = 3 AND var2 >= 0 |9 |0 |
|var1 = 3 AND var2 >= 0 |9 |0 |
|var1 = 2 AND var2 >= 0 |9 |0 |
+-------------------------+---------+-----+
I want to transform this dataframe to the dataframe below where flag is the boolean value found after evaluating the expression in column 'expr'
+---------------------------------------------------+
|expr |var1 |var2 |flag |
+-------------------------+---------+-----+---------+
|var1 > 7 |9 |0 | True |
|var1 > 7 |9 |0 | True |
|var1 > 7 |9 |0 | True |
|var1 > 7 |9 |0 | True |
|var1 = 3 AND var2 >= 0 |9 |0 | . |
|var1 = 3 AND var2 >= 0 |9 |0 | . |
|var1 = 3 AND var2 >= 0 |9 |0 | . |
|var1 = 3 AND var2 >= 0 |9 |0 | . |
|var1 = 2 AND var2 >= 0 |9 |0 | . |
+-------------------------+---------+-----+---------+
I have tried using expr function like this:
df.withColumn('flag', expr(col('expr')))
It will fail as expected because expr function expects a string as parameter.
Another idea I thought of using is making a UDF and passing the 'expr' column's value to it, but that will not allow me to use the expr function of pyspark because UDFs are all non-spark code.
What should my approach be? Any suggestions please?