How to filter out rows with lots of conditions in pyspark?

Question

Let's say that these are my data:

` Product_Number| Condition|   Type     | Country
           1    |  New     | Chainsaw   | USA
           1    |  Old     | Chainsaw   | USA
           1    |  Null    | Chainsaw   | USA
           2    |  Old     | Tractor    | India
           3    |  Null    | Machete    | Colombia
           4    |  New     | Shovel     | Brazil
           5    |  New     | Fertilizer | Italy
           5    |  Old     | Fertilizer | Italy `

The problem is that sometimes, there are more than one Product_Number while it should be unique. What I am trying to do is from the ones that are in the dataframe more than once, to take the ones whose Condition is New without touching the rest. That gets to the result:

` Product_Number| Condition|   Type     | Country
           1    |  New     | Chainsaw   | USA
           2    |  Old     | Tractor    | India
           3    |  Null    | Machete    | Colombia
           4    |  New     | Shovel     | Brazil
           5    |  New     | Fertilizer | Italy`

What I tried to do is first to see how many distinct product numbers I have:

df.select('Product_Number').distinct().count()

Then identify the product numbers that exist most than once and put them in a list:

numbers = df.select('Product_Number').groupBy('Product_Number').count().where('count > 1')\
                   .select('Product_Number').rdd.flatMap(lambda x: x).collect()

Then I am trying to filter out the product numbers that exist more than once and the Condition isn't new. By filtering them out, if it is done perfectly, counting it should give the same number as df.select('Product_Number').distinct().count().

The code that I have tried is:

1)df.filter(~(df.Product_Number.isin(numbers)) & ~((df.Condition == 'Old') | (df.Condition.isNull())))

df.filter(~((df.Product_Number.isin(numbers)) & ((df.Condition == 'Old') | (df.Condition.isNull()))))
df.filter(~(df.Product_Number.isin(numbers)) & (df.Condition == 'New'))

However, I haven't succeeded until now.

score 0 · Answer 1 · answered Aug 10 '22 at 16:28

You conditions should be

(Product_Number is in numbers AND Condition == New) OR 
(Product_Number is not in numbers)

So, this is the correct filter condition.

df.filter((df.Product_Number.isin(numbers) & (df.Condition == 'New')) 
| (~df.Product_Number.isin(numbers)))

However, collect can be a heavy operation if you have large dataset and you can rewrite your code without collect.

from pyspark.sql import functions as F
w = Window.partitionBy('Product_Number')
df = (df.withColumn('cnt', F.count('*').over(w))
 .filter(((F.col('Condition') == 'New') & (F.col('cnt') > 1)) | (F.col('cnt') == 1))
)

How to filter out rows with lots of conditions in pyspark?

1 Answers1