Pyspark Cumulative sum within Partition for moving last 2 (N) rows

Question

Let's say I have the following dataframe which is sorted for ease visually:

How would I utilize window functions to create a new column that sums the previous row ordered by Month column within each period partition:

The following is my attempt at it, but I'm obviously doing something wrong with respect to the rowsBetween function.

df = df.withColumn('CustLast2', sum('Cust').over(Window.partitionBy("period").orderBy('Month').rowsBetween(Window.currentRow, -1))

score 1 · Accepted Answer · answered Feb 09 '22 at 16:49

What you want is to sum the last 2 rows (current row included), so simply specify the rowsBetween like this:

from pyspark.sql import functions as F, Window

w = Window.partitionBy('Period').orderBy('Month').rowsBetween(-1, Window.currentRow)

df = df.withColumn('CustLast2', F.sum('Cust').over(w))

You inverted the lower and upper bounds of the window frame in your attempt.

Yoan B. M.Sc · Answer 2 · 2022-02-09T19:41:00.427

0

I think you're almost their, you just have to replace -1 by Window.unboundedPreceding

df = df.withColumn('CustLast2', sum('Cust').over(Window.partitionBy("period").orderBy('Month').rowsBetween(Window.unboundedPreceding, Window.currentRow))

Otherwise you're just doing the sum on 2 consecutive rows within the same period.

edited Feb 09 '22 at 19:41

answered Feb 08 '22 at 19:46

Yoan B. M.Sc

1,485
5
18

Is there a way to generalize this? If I wanted the window to be 6 or 7 or any arbtrirary number, how could I do this? – ben890 Feb 08 '22 at 22:15
@ben890, that's what `Window.unboundedPreceding` is for, it's going to look back for all the current partition. – Yoan B. M.Sc Feb 09 '22 at 00:25
This won't even execute. You can't specify the boundary end of the frame as `Window.unboundedPreceding` – blackbishop Feb 09 '22 at 16:51
@blackbishop, I'm using it in many pipelines so you can, but it might not be what you need though. https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.Window.rowsBetween.html – Yoan B. M.Sc Feb 09 '22 at 18:52
@YoanB.M.Sc Can you show complete example using `.rowsBetween(Window.currentRow, Window.unboundedPreceding)`? I'm curious.. Maybe you mean `Window.unboundedFollowing` for the upper bound? – blackbishop Feb 09 '22 at 19:04
@blackbishop, apologies I see what you mean the argument where inverted. I've edited the post. `Window.unboundedFollowing` won't gibe you the same result though. – Yoan B. M.Sc Feb 09 '22 at 19:41

score 0 · Answer 3 · answered Feb 08 '22 at 19:54

We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.

First we start with a test Pandas DataFrame (we'll port to Spark later):

import pandas as pd
df = pd.DataFrame({"date": ["2020-01-01", "2020-01-02", "2020-01-03"] * 3, 
                   "period": [0,0,0,1,1,1,2,2,2],
                   "val": [4,5,2] * 3})

Then we make a Pandas based function. Notice this is meant to be applied per group. We will partition later.

def rolling(df: pd.DataFrame) -> pd.DataFrame:
    df["cum_sum"] = df["val"].rolling(2).sum().fillna(df["val"])
    return df

Now we can use the Fugue transform function to test on Pandas. This function handles the partition and presort also.

from fugue import transform
transform(df, rolling, schema="*, cum_sum:float", partition={"by":"period", "presort": "date asc"})

Because this works, we can bring it to Spark just by specifying the engine:

import fugue_spark
transform(df, rolling, schema="*, cum_sum:float", partition={"by":"period", "presort": "date asc"}, engine="spark").show()

+----------+------+---+-------+
|      date|period|val|cum_sum|
+----------+------+---+-------+
|2020-01-01|     0|  4|    4.0|
|2020-01-02|     0|  5|    9.0|
|2020-01-03|     0|  2|    7.0|
|2020-01-01|     1|  4|    4.0|
|2020-01-02|     1|  5|    9.0|
|2020-01-03|     1|  2|    7.0|
|2020-01-01|     2|  4|    4.0|
|2020-01-02|     2|  5|    9.0|
|2020-01-03|     2|  2|    7.0|
+----------+------+---+-------+

Notice you need .show() now because of Spark's lazy evaluation. The Fugue transform function can take in both Pandas and Spark DataFrames and will output

score 0 · Answer 4 · answered Feb 08 '22 at 20:13

You can solve this problem using the code below:

(df.withColumn('last_value', F.lag(F.col('Cust')).over(W.partitionBy(['Period']).orderBy(F.col('Month'))))
   .withColumn('last_value', F.when(F.col('last_value').isNull(), 0).otherwise(F.col('last_value')))
   .withColumn('cumSum', F.col('Cust') + F.col('last_value')))

Pyspark Cumulative sum within Partition for moving last 2 (N) rows

4 Answers4