I am working on spark streaming project where I have to calculate the cumulative sum of one column of dataframe. I have successfully calculated the cumulative sum using this link. But spark can only calculate the sum within the batch. For next batch it start with scratch. I have to apply the logic on previous and upcoming batch. How I can store all the upcoming data or remember the previous spark batch to calculate the cumulative sum.
output of 1 batch
+---------------+-------------------+----+----+----+
| product_id| date_time| ack|val1|val2|
+---------------+-------------------+----+----+----+
|4008607333T.upf|2017-12-13:02:27:01|3-46| 50| 50|
|4008607333T.upf|2017-12-13:02:27:03|3-46| 60| 110|
+---------------+-------------------+----+----+----+
output of 2 batch
+---------------+-------------------+----+----+----+
| product_id| date_time| ack|val1|val2|
+---------------+-------------------+----+----+----+
|4008607333T.upf|2017-12-13:03:27:01|3-46| 30| 30|
|4008607333T.upf|2017-12-13:03:27:03|3-46| 20| 50|
+---------------+-------------------+----+----+----+
it should be
output of 2 batch
+---------------+-------------------+----+----+----+
| product_id| date_time| ack|val1|val2|
+---------------+-------------------+----+----+----+
|4008607333T.upf|2017-12-13:03:27:01|3-46| 30| 30|
|4008607333T.upf|2017-12-13:03:27:03|3-46| 20| 50|
+---------------+-------------------+----+----+----+
spark code
val w = Window.partitionBy($"product_id", $"ack")
.orderBy($"date_time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val newDf = inputDF.withColumn("val_sum", sum($"val1").over(w))
.withColumn("val2_sum", sum($"val2").over(w))