1

I have a pyspark dataframe like this:

+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null|   CT|
|222|name1|   CT|
|222|name2|   CT|
|333|name3|   CT|
|333|name4|   CT|
|333| null|   CT|
+---+-----+-----+

For a given ID, I would like to keep that record even though column "name" is null if its a ID is not repeated, but if the ID is repeated, then I would like to check on name column and make sure it does not contain duplicates within that ID, and also remove if "name" is null ONLY for repeated IDs. Below is the desired output:

+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null|   CT|
|222|name1|   CT|
|222|name2|   CT|
|333|name3|   CT|
|333|name4|   CT|
+---+-----+-----+

How can I achieve this in PySpark?

Shaido
  • 27,497
  • 23
  • 70
  • 73
  • This might be what you are looking for python equivalent https://stackoverflow.com/questions/39982135/apache-spark-dealing-with-case-statements – tarun May 06 '20 at 06:26

2 Answers2

4

You can do this by grouping by the id column and count the number of names in each group. Null values will be ignored by default in Spark so any group that has 0 in count should be kept. We can now filter away any nulls in groups with a count larger than 0.

In Scala this can be done with a window function as follows:

val w = Window.partitionBy("id")
val df2 = df.withColumn("gCount", count($"name").over(w))
  .filter($"name".isNotNull or $"gCount" === 0)
  .drop("gCount")

The PySpark equivalent:

w = Window.partitionBy("id")
df.withColumn("gCount", count("name").over(w))
  .filter((col("name").isNotNull()) | (col("gCount") == 0))
  .drop("gCount")

The above will not remove rows that have multiple nulls for the same id (all these will be kept).

If these should be removed as well, keeping only a single row with name==null, an easy way would be to use .dropDuplicates(['id','name']) before or after running the above code. Note that this also will remove any other duplicates (in which case .dropDuplicates(['id','name', 'state']) could be preferable).

Shaido
  • 27,497
  • 23
  • 70
  • 73
  • This fails if there is a duplicate name for one ID – Shubham Jain May 06 '20 at 06:53
  • @ShubhamJain: Yes, you are correct. This is a bit harder to take into account, the easiest way is to run `dropDuplciates()` before running the above code. Whether that is possible or not depends on the real case scenario. I added some information regarding this to the answer. – Shaido May 06 '20 at 07:37
  • .dropDuplicates(['id','name']) could be appended to your filter with the same results. It needn't be run before near as I can sort. – born_naked May 06 '20 at 07:48
2

I think you can do that in two steps. First, count values by id

import pyspark.sql.window as psw
w = psw.Window.partitionBy("id")
df = df.withColumn("n",psf.sum(psf.lit(1)).over(w))

Then filter to remove Null when n<1:

df.filter(!((psf.col('name').isNull()) & (psf.col('n') > 1)))

Edit

As mentioned by @Shubham Jain, if you have several Null values for name (duplicates), the above filter will keep them. In that case, the solution proposed by @Shaido is useful: add a post treatment using .dropDuplicates(['id','name']). Or .dropDuplicates(['id','name','state']), following your preference

linog
  • 5,786
  • 3
  • 14
  • 28