Brand new to Pyspark and I'm refactoring some R code that is starting to lose it's ability to scale properly. I return a dataframe that has a number of columns with numeric values and I'm trying to filter this result set into a new, smaller result set using multiple compound conditions.
from pyspark.sql import functions as f
matches = df.filter(f.when('df.business') >=0.9 & (f.when('df.city') == 1.0) & (f.when('street') >= 0.7)) |
(f.when('df.phone') == 1) & (f.when('df.firstname') == 1) & (f.when('df.street') == 1) & (f.when('df.city' == 1)) |
(f.when('df.business') >=0.9) & (f.when('df.street') >=0.9) & (f.when('df.city')) == 1))) |
(f.when('df.phone') == 1) & (f.when('df.street') == 1) & (f.when('df.city')) == 1))) |
(f.when('df.lastname') >=0.9) & (f.when('df.phone') == 1) & (f.when('df.business')) >=0.9 & (f.when('df.city') == 1))) |
(f.when('df.phone') == 1 & (f.when('df.street') == 1 & (f.when('df.city') == 1) & (f.when('df.busname') >= 0.6)))
Essentially I'm just trying to return a new dataframe, "matchs" where the columns in the previous dataframe, "sdf" fall into the afore pasted criterion. I've read a couple of other filtering posts such as
multiple conditions for filter in spark data frames
PySpark: multiple conditions in when clause
however I still can't seem to get it right. I suppose I could filter it on one condition at a time and then call a unionall but I felt as if this would be the cleaner way.