Problem
I have a Spark DataFrame with a column which contains values not for every row, but only for some rows (on a somewhat regular basis, e.g. only every 5 to 10 rows based on the id).
Now, I would like to apply a window function to the rows containing values involving the two previous and two following rows which also contain values (so basically pretending that all the rows containing nulls don't exist = don't count towards the rowsBetween
-range of the window). In practice, my effective window size could be arbitrary depending on how many rows containing nulls exist. However, I always need exactly two values before and after. Also, the end result should contain all rows because of other columns which contain important information.
Example
For example, I want to calculate the sum over the previous two, the current and the next two (not-null) values for rows in the follwing dataframe which are not null:
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql import Row
df = spark.createDataFrame([Row(id=i, val=i * 2 if i % 5 == 0 else None, foo='other') for i in range(100)])
df.show()
Output:
+-----+---+----+
| foo| id| val|
+-----+---+----+
|other| 0| 0|
|other| 1|null|
|other| 2|null|
|other| 3|null|
|other| 4|null|
|other| 5| 10|
|other| 6|null|
|other| 7|null|
|other| 8|null|
|other| 9|null|
|other| 10| 20|
|other| 11|null|
|other| 12|null|
|other| 13|null|
|other| 14|null|
|other| 15| 30|
|other| 16|null|
|other| 17|null|
|other| 18|null|
|other| 19|null|
+-----+---+----+
If I just use a Window function over the dataframe as is, I can't specify the condition that the values must not be null, so the window only contains null values making the sum equal to the row value:
df2 = df.withColumn('around_sum', F.when(F.col('val').isNotNull(), F.sum(F.col('val')).over(Window.rowsBetween(-2, 2).orderBy(F.col('id')))).otherwise(None))
df2.show()
Result:
+-----+---+----+----------+
| foo| id| val|around_sum|
+-----+---+----+----------+
|other| 0| 0| 0|
|other| 1|null| null|
|other| 2|null| null|
|other| 3|null| null|
|other| 4|null| null|
|other| 5| 10| 10|
|other| 6|null| null|
|other| 7|null| null|
|other| 8|null| null|
|other| 9|null| null|
|other| 10| 20| 20|
|other| 11|null| null|
|other| 12|null| null|
|other| 13|null| null|
|other| 14|null| null|
|other| 15| 30| 30|
|other| 16|null| null|
|other| 17|null| null|
|other| 18|null| null|
|other| 19|null| null|
+-----+---+----+----------+
I was able to achieve the desired result by creating a second dataframe only containing the rows where the value is not null, doing the window operation there and later joining the result again:
df3 = df.where(F.col('val').isNotNull())\
.withColumn('around_sum', F.sum(F.col('val')).over(Window.rowsBetween(-2, 2).orderBy(F.col('id'))))\
.select(F.col('around_sum'), F.col('id').alias('id2'))
df3 = df.join(df3, F.col('id') == F.col('id2'), 'outer').orderBy(F.col('id')).drop('id2')
df3.show()
Result:
+-----+---+----+----------+
| foo| id| val|around_sum|
+-----+---+----+----------+
|other| 0| 0| 30|
|other| 1|null| null|
|other| 2|null| null|
|other| 3|null| null|
|other| 4|null| null|
|other| 5| 10| 60|
|other| 6|null| null|
|other| 7|null| null|
|other| 8|null| null|
|other| 9|null| null|
|other| 10| 20| 100|
|other| 11|null| null|
|other| 12|null| null|
|other| 13|null| null|
|other| 14|null| null|
|other| 15| 30| 150|
|other| 16|null| null|
|other| 17|null| null|
|other| 18|null| null|
|other| 19|null| null|
+-----+---+----+----------+
Question
Now I am wondering whether I can get rid of the join (and the second DataFrame) somehow and instead specifiy the condition in the Window function directly.
Is this possible?