How to replace any null in pyspark df with value from the below row, same column

Question

Let's say I have a pyspark DF:

| Column A | Column B |
| -------- | -------- |
| val1     | val1B    |
| null     | val2B    |
| val2     | null     |
| val3     | val3B    |

Can someone help me with replacing any null value in any column (for the whole df) with the value right below it? So the final table should look like this:

Column A	Column B
val1	val1B
val2	val2B
val2	val3B
val3	val3B

How could this be done? Can I get a code demo if possible? Thank you!

All I've really gotten through is counting all the row nums and creating a condition to find the row nums with all of the null values. So I'm left with a table like this:

Column A	Column B	row_num
null	val2B	2
val2	null	3

But I don't think this step is needed. I'm stuck as to what to do.

Usually null values are replaced with default values, or pre computed values like averages, or median etc. You can fill null values using this [fillna] method https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.fillna.html — greenie, Jan 06 '23 at 21:27

score 1 · Accepted Answer · answered Jan 06 '23 at 21:40

Use list squares to coalesce each column with the lead window function. Code below

df.select(*[coalesce(col(x),lead(x).over(Window.partitionBy().orderBy( monotonically_increasing_id()))).alias(x) for x in df.columns]).show()


+--------+--------+
|Column A|Column B|
+--------+--------+
|    val1|   val1B|
|    val2|   val2B|
|    val2|   val3B|
|    val3|   val3B|
+--------+--------+

How to replace any null in pyspark df with value from the below row, same column

1 Answers1