1

Let's say I have a pyspark DF:

| Column A | Column B |
| -------- | -------- |
| val1     | val1B    |
| null     | val2B    |
| val2     | null     |
| val3     | val3B    |

Can someone help me with replacing any null value in any column (for the whole df) with the value right below it? So the final table should look like this:

Column A Column B
val1 val1B
val2 val2B
val2 val3B
val3 val3B

How could this be done? Can I get a code demo if possible? Thank you!

All I've really gotten through is counting all the row nums and creating a condition to find the row nums with all of the null values. So I'm left with a table like this:

Column A Column B row_num
null val2B 2
val2 null 3

But I don't think this step is needed. I'm stuck as to what to do.

starball
  • 20,030
  • 7
  • 43
  • 238
newmon
  • 25
  • 3
  • Usually null values are replaced with default values, or pre computed values like averages, or median etc. You can fill null values using this [fillna] method https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.fillna.html – greenie Jan 06 '23 at 21:27

1 Answers1

1

Use list squares to coalesce each column with the lead window function. Code below

df.select(*[coalesce(col(x),lead(x).over(Window.partitionBy().orderBy( monotonically_increasing_id()))).alias(x) for x in df.columns]).show()


+--------+--------+
|Column A|Column B|
+--------+--------+
|    val1|   val1B|
|    val2|   val2B|
|    val2|   val3B|
|    val3|   val3B|
+--------+--------+
wwnde
  • 26,119
  • 6
  • 18
  • 32