Pyspark compound filter, multiple conditions

Question

Brand new to Pyspark and I'm refactoring some R code that is starting to lose it's ability to scale properly. I return a dataframe that has a number of columns with numeric values and I'm trying to filter this result set into a new, smaller result set using multiple compound conditions.

from pyspark.sql import functions as f

matches = df.filter(f.when('df.business') >=0.9 & (f.when('df.city') == 1.0) & (f.when('street') >= 0.7)) |
                   (f.when('df.phone') == 1) & (f.when('df.firstname') == 1) & (f.when('df.street') == 1) & (f.when('df.city' == 1)) |
                   (f.when('df.business') >=0.9) & (f.when('df.street') >=0.9) & (f.when('df.city')) == 1))) |
                   (f.when('df.phone') == 1) & (f.when('df.street') == 1) & (f.when('df.city')) == 1))) |
                   (f.when('df.lastname') >=0.9) & (f.when('df.phone') == 1) & (f.when('df.business')) >=0.9 & (f.when('df.city') == 1))) |
                   (f.when('df.phone') == 1 & (f.when('df.street') == 1 & (f.when('df.city') == 1) & (f.when('df.busname') >= 0.6)))

Essentially I'm just trying to return a new dataframe, "matchs" where the columns in the previous dataframe, "sdf" fall into the afore pasted criterion. I've read a couple of other filtering posts such as

multiple conditions for filter in spark data frames

PySpark: multiple conditions in when clause

however I still can't seem to get it right. I suppose I could filter it on one condition at a time and then call a unionall but I felt as if this would be the cleaner way.

I think your parenthesis are not balanced. I think you mean that all the statements in one line have an `and` clause and there are 6 `or` statement, corresponding to each line. Is that correct? — cph_sto, Jan 29 '19 at 15:23
@cph_sto That is correct. Each line contains multiple and conditions and each line is or — DataDog, Jan 29 '19 at 15:35
I have posted an answer. Please check it. There may be typos in the digits or `>=, <=, ==` signs, so check them. — cph_sto, Jan 29 '19 at 15:38

cph_sto · Accepted Answer · 2019-01-29T15:48:39.317

Well, since @DataDog has clarified it, so the code below replicates the filters put by OP.

Note: Each and every clause/sub-clause should be inside the parenthesis. If I have missed out, then it's an inadvertent mistake, as I did not have the data to test it. But the idea remains the same.

matches = df.filter(
                ((df.business >= 0.9) & (df.city ==1) & (df.street >= 0.7))
                                    |
                ((df.phone == 1) & (df.firstname == 1) & (df.street ==1) & (df.city ==1))
                                    |
                ((df.business >= 0.9) & (df.street >= 0.9) & (df.city ==1))
                                    |
                ((df.phone == 1) & (df.street == 1) & (df.city ==1))
                                    |
                ((df.lastname >= 0.9) & (df.phone == 1) & (df.business >=0.9) & (df.city ==1))
                                    |
                ((df.phone == 1) & (df.street == 1) & (df.city ==1) & (df.busname >=0.6))
)

Shoot. I had something similar but changed it when I read a few other posts. This 100% worked, thanks very much!. — DataDog, Jan 29 '19 at 15:50

Pyspark compound filter, multiple conditions

1 Answers1