Eliminate null value rows for a specific column while doing partitionBy column in pyspark

Question

I have a pyspark dataframe like this:

+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null|   CT|
|222|name1|   CT|
|222|name2|   CT|
|333|name3|   CT|
|333|name4|   CT|
|333| null|   CT|
+---+-----+-----+

For a given ID, I would like to keep that record even though column "name" is null if its a ID is not repeated, but if the ID is repeated, then I would like to check on name column and make sure it does not contain duplicates within that ID, and also remove if "name" is null ONLY for repeated IDs. Below is the desired output:

+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null|   CT|
|222|name1|   CT|
|222|name2|   CT|
|333|name3|   CT|
|333|name4|   CT|
+---+-----+-----+

How can I achieve this in PySpark?

This might be what you are looking for python equivalent https://stackoverflow.com/questions/39982135/apache-spark-dealing-with-case-statements — tarun, May 06 '20 at 06:26

Shaido · Accepted Answer · 2020-05-06T08:01:11.253

You can do this by grouping by the id column and count the number of names in each group. Null values will be ignored by default in Spark so any group that has 0 in count should be kept. We can now filter away any nulls in groups with a count larger than 0.

In Scala this can be done with a window function as follows:

val w = Window.partitionBy("id")
val df2 = df.withColumn("gCount", count($"name").over(w))
  .filter($"name".isNotNull or $"gCount" === 0)
  .drop("gCount")

The PySpark equivalent:

w = Window.partitionBy("id")
df.withColumn("gCount", count("name").over(w))
  .filter((col("name").isNotNull()) | (col("gCount") == 0))
  .drop("gCount")

The above will not remove rows that have multiple nulls for the same id (all these will be kept).

If these should be removed as well, keeping only a single row with name==null, an easy way would be to use .dropDuplicates(['id','name']) before or after running the above code. Note that this also will remove any other duplicates (in which case .dropDuplicates(['id','name', 'state']) could be preferable).

@ShubhamJain: Yes, you are correct. This is a bit harder to take into account, the easiest way is to run `dropDuplciates()` before running the above code. Whether that is possible or not depends on the real case scenario. I added some information regarding this to the answer. — Shaido, May 06 '20 at 07:37
.dropDuplicates(['id','name']) could be appended to your filter with the same results. It needn't be run before near as I can sort. — born_naked, May 06 '20 at 07:48

linog · Answer 2 · 2020-05-06T12:08:54.677

2

I think you can do that in two steps. First, count values by id

import pyspark.sql.window as psw
w = psw.Window.partitionBy("id")
df = df.withColumn("n",psf.sum(psf.lit(1)).over(w))

Then filter to remove Null when n<1:

df.filter(!((psf.col('name').isNull()) & (psf.col('n') > 1)))

Edit

As mentioned by @Shubham Jain, if you have several Null values for name (duplicates), the above filter will keep them. In that case, the solution proposed by @Shaido is useful: add a post treatment using .dropDuplicates(['id','name']). Or .dropDuplicates(['id','name','state']), following your preference

edited May 06 '20 at 12:08

answered May 06 '20 at 06:49

linog

5,786
3
14
28

1

This fails if there is a duplicate name for one ID – Shubham Jain May 06 '20 at 06:55

Eliminate null value rows for a specific column while doing partitionBy column in pyspark

2 Answers2

Edit