I am looking for a way to grow a cumulative value in a column using the lag function in pySpark to first fetch the previous value in the column then add to it but tis is failing as presumably it can't find itself before it exists. Is there a way around this?
Asked
Active
Viewed 199 times
0
-
use something like rangebetween from window for cumulatives: https://stackoverflow.com/a/45946350/9840637 – anky Mar 17 '22 at 05:52
-
1Please provide enough code so others can better understand or reproduce the problem. – Community Mar 17 '22 at 08:12
1 Answers
1
Maybe something like this you are looking for?
df = spark.createDataFrame(
[
('1',20),
('2',34),
('3',12)
], ['id','value'])
from pyspark.sql import Window as W
w = W.orderBy('id').rowsBetween(W.unboundedPreceding, 0)
df\
.withColumn('cumul_sum', F.sum(F.col('value')).over(w))\
.show()
+---+-----+---------+
| id|value|cumul_sum|
+---+-----+---------+
| 1| 20| 20|
| 2| 34| 54|
| 3| 12| 66|
+---+-----+---------+

Luiz Viola
- 2,143
- 1
- 11
- 30