Can the pySpark lag function reference itself?

Question

I am looking for a way to grow a cumulative value in a column using the lag function in pySpark to first fetch the previous value in the column then add to it but tis is failing as presumably it can't find itself before it exists. Is there a way around this?

use something like rangebetween from window for cumulatives: https://stackoverflow.com/a/45946350/9840637 — anky, Mar 17 '22 at 05:52
Please provide enough code so others can better understand or reproduce the problem. — Community, Mar 17 '22 at 08:12

score 1 · Accepted Answer · answered Mar 17 '22 at 09:58

Maybe something like this you are looking for?

df  = spark.createDataFrame(
  [
('1',20),
('2',34),
('3',12)
  ], ['id','value'])

from pyspark.sql import Window as W

w = W.orderBy('id').rowsBetween(W.unboundedPreceding, 0)

df\
    .withColumn('cumul_sum', F.sum(F.col('value')).over(w))\
    .show()

+---+-----+---------+
| id|value|cumul_sum|
+---+-----+---------+
|  1|   20|       20|
|  2|   34|       54|
|  3|   12|       66|
+---+-----+---------+

Can the pySpark lag function reference itself?

1 Answers1