0

Given the column A as shown in the following example I'd like to have the column B where each record is the sum of the current record in A and previous record in B:

+-------+
| A | B |
+-------+
| 0 | 0 |
| 0 | 0 |
| 1 | 1 |
| 0 | 1 | 
| 1 | 2 |  
| 1 | 3 | 
| 0 | 3 | 
| 0 | 3 | 

So in a way I would be interested into consider previous record into my operation. I'm aware of the F.lag function but I don't see how it can work in this way. Any ideas on how to get this operation done?

I'm open to rephrasing if the idea can be expressed in a better way.

Vzzarr
  • 4,600
  • 2
  • 43
  • 80

1 Answers1

1

It seems you're trying to do a rolling sum of A. You can do a sum over a window, e.g.

from pyspark.sql import functions as F, Window

df2 = df.withColumn('B', F.sum('A').over(Window.orderBy('ordering_col')))

But you would need a column to order by, otherwise the "previous record" is not well-defined because Spark dataframes are unordered.

mck
  • 40,932
  • 13
  • 35
  • 50