How to do a rolling sum in PySpark?

Question

Given the column A as shown in the following example I'd like to have the column B where each record is the sum of the current record in A and previous record in B:

+-------+
| A | B |
+-------+
| 0 | 0 |
| 0 | 0 |
| 1 | 1 |
| 0 | 1 | 
| 1 | 2 |  
| 1 | 3 | 
| 0 | 3 | 
| 0 | 3 |

So in a way I would be interested into consider previous record into my operation. I'm aware of the F.lag function but I don't see how it can work in this way. Any ideas on how to get this operation done?

I'm open to rephrasing if the idea can be expressed in a better way.

score 1 · Accepted Answer · answered Feb 11 '21 at 16:09

It seems you're trying to do a rolling sum of A. You can do a sum over a window, e.g.

from pyspark.sql import functions as F, Window

df2 = df.withColumn('B', F.sum('A').over(Window.orderBy('ordering_col')))

But you would need a column to order by, otherwise the "previous record" is not well-defined because Spark dataframes are unordered.

How to do a rolling sum in PySpark?

1 Answers1