Let's say that these are my data:
` Product_Number| Condition| Type | Country
1 | New | Chainsaw | USA
1 | Old | Chainsaw | USA
1 | Null | Chainsaw | USA
2 | Old | Tractor | India
3 | Null | Machete | Colombia
4 | New | Shovel | Brazil
5 | New | Fertilizer | Italy
5 | Old | Fertilizer | Italy `
The problem is that sometimes, there are more than one Product_Number while it should be unique. What I am trying to do is from the ones that are in the dataframe more than once, to take the ones whose Condition is New without touching the rest. That gets to the result:
` Product_Number| Condition| Type | Country
1 | New | Chainsaw | USA
2 | Old | Tractor | India
3 | Null | Machete | Colombia
4 | New | Shovel | Brazil
5 | New | Fertilizer | Italy`
What I tried to do is first to see how many distinct product numbers I have:
df.select('Product_Number').distinct().count()
Then identify the product numbers that exist most than once and put them in a list:
numbers = df.select('Product_Number').groupBy('Product_Number').count().where('count > 1')\
.select('Product_Number').rdd.flatMap(lambda x: x).collect()
Then I am trying to filter out the product numbers that exist more than once and the Condition isn't new. By filtering them out, if it is done perfectly, counting it should give the same number as df.select('Product_Number').distinct().count().
The code that I have tried is:
1)df.filter(~(df.Product_Number.isin(numbers)) & ~((df.Condition == 'Old') | (df.Condition.isNull())))
df.filter(~((df.Product_Number.isin(numbers)) & ((df.Condition == 'Old') | (df.Condition.isNull()))))
df.filter(~(df.Product_Number.isin(numbers)) & (df.Condition == 'New'))
However, I haven't succeeded until now.