4

Problem

I have a Spark DataFrame with a column which contains values not for every row, but only for some rows (on a somewhat regular basis, e.g. only every 5 to 10 rows based on the id).

Now, I would like to apply a window function to the rows containing values involving the two previous and two following rows which also contain values (so basically pretending that all the rows containing nulls don't exist = don't count towards the rowsBetween-range of the window). In practice, my effective window size could be arbitrary depending on how many rows containing nulls exist. However, I always need exactly two values before and after. Also, the end result should contain all rows because of other columns which contain important information.

Example

For example, I want to calculate the sum over the previous two, the current and the next two (not-null) values for rows in the follwing dataframe which are not null:

from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql import Row

df = spark.createDataFrame([Row(id=i, val=i * 2 if i % 5 == 0 else None, foo='other') for i in range(100)])
df.show()

Output:

+-----+---+----+
|  foo| id| val|
+-----+---+----+
|other|  0|   0|
|other|  1|null|
|other|  2|null|
|other|  3|null|
|other|  4|null|
|other|  5|  10|
|other|  6|null|
|other|  7|null|
|other|  8|null|
|other|  9|null|
|other| 10|  20|
|other| 11|null|
|other| 12|null|
|other| 13|null|
|other| 14|null|
|other| 15|  30|
|other| 16|null|
|other| 17|null|
|other| 18|null|
|other| 19|null|
+-----+---+----+

If I just use a Window function over the dataframe as is, I can't specify the condition that the values must not be null, so the window only contains null values making the sum equal to the row value:

df2 = df.withColumn('around_sum', F.when(F.col('val').isNotNull(), F.sum(F.col('val')).over(Window.rowsBetween(-2, 2).orderBy(F.col('id')))).otherwise(None))
df2.show()

Result:

+-----+---+----+----------+
|  foo| id| val|around_sum|
+-----+---+----+----------+
|other|  0|   0|         0|
|other|  1|null|      null|
|other|  2|null|      null|
|other|  3|null|      null|
|other|  4|null|      null|
|other|  5|  10|        10|
|other|  6|null|      null|
|other|  7|null|      null|
|other|  8|null|      null|
|other|  9|null|      null|
|other| 10|  20|        20|
|other| 11|null|      null|
|other| 12|null|      null|
|other| 13|null|      null|
|other| 14|null|      null|
|other| 15|  30|        30|
|other| 16|null|      null|
|other| 17|null|      null|
|other| 18|null|      null|
|other| 19|null|      null|
+-----+---+----+----------+

I was able to achieve the desired result by creating a second dataframe only containing the rows where the value is not null, doing the window operation there and later joining the result again:

df3 = df.where(F.col('val').isNotNull())\
    .withColumn('around_sum', F.sum(F.col('val')).over(Window.rowsBetween(-2, 2).orderBy(F.col('id'))))\
    .select(F.col('around_sum'), F.col('id').alias('id2'))
df3 = df.join(df3, F.col('id') == F.col('id2'), 'outer').orderBy(F.col('id')).drop('id2')
df3.show()

Result:

+-----+---+----+----------+
|  foo| id| val|around_sum|
+-----+---+----+----------+
|other|  0|   0|        30|
|other|  1|null|      null|
|other|  2|null|      null|
|other|  3|null|      null|
|other|  4|null|      null|
|other|  5|  10|        60|
|other|  6|null|      null|
|other|  7|null|      null|
|other|  8|null|      null|
|other|  9|null|      null|
|other| 10|  20|       100|
|other| 11|null|      null|
|other| 12|null|      null|
|other| 13|null|      null|
|other| 14|null|      null|
|other| 15|  30|       150|
|other| 16|null|      null|
|other| 17|null|      null|
|other| 18|null|      null|
|other| 19|null|      null|
+-----+---+----+----------+

Question

Now I am wondering whether I can get rid of the join (and the second DataFrame) somehow and instead specifiy the condition in the Window function directly.

Is this possible?

Matthias
  • 12,053
  • 4
  • 49
  • 91
  • Did you consider with `fillna()` ? – Bala Nov 20 '18 at 17:44
  • @Bala: Not sure how I could use fillna() for this purpose - I don't really want to fill the null values. I also looked at using last/first, but I haven't found a solution yet which can handle more than +/- one values "around" (but I need two in this case). – Matthias Nov 21 '18 at 11:29

1 Answers1

1

A good solution will be to start with a filling the nulls with 0 and then perform the operations. Do the fillna only on the column involved, like this:

df = df.fillna(0,subset=['val'])

If you are not sure if you want to get rid of the nulls, copy the column value and then calculate the window over that column, so you can get rid of it after the operation.

Like this:

df = df.withColumn('val2',F.col('val'))
df = df.fillna(0,subset=['val2'])
# Then perform the operations over val2.
df = df.withColumn('around_sum', F.sum(F.col('val2')).over(Window.rowsBetween(-2, 2).orderBy(F.col('id'))))
# After the operations, get rid of the copy column
df = df.drop('val2')
Manrique
  • 2,083
  • 3
  • 15
  • 38
  • Thanks for the response! The result looks very different from my expected result, however: When I run it, around_sum == value for my example dataframe. – Matthias Nov 27 '18 at 14:04