How to remember the previous batch of spark streaming to calculate cumulative sum

Question

I am working on spark streaming project where I have to calculate the cumulative sum of one column of dataframe. I have successfully calculated the cumulative sum using this link. But spark can only calculate the sum within the batch. For next batch it start with scratch. I have to apply the logic on previous and upcoming batch. How I can store all the upcoming data or remember the previous spark batch to calculate the cumulative sum.

output of 1 batch

+---------------+-------------------+----+----+----+
|     product_id|          date_time| ack|val1|val2|
+---------------+-------------------+----+----+----+
|4008607333T.upf|2017-12-13:02:27:01|3-46|  50|  50|
|4008607333T.upf|2017-12-13:02:27:03|3-46|  60| 110|

+---------------+-------------------+----+----+----+
output of 2 batch
+---------------+-------------------+----+----+----+
|     product_id|          date_time| ack|val1|val2|
+---------------+-------------------+----+----+----+
|4008607333T.upf|2017-12-13:03:27:01|3-46|  30|  30|
|4008607333T.upf|2017-12-13:03:27:03|3-46|  20|  50|
+---------------+-------------------+----+----+----+

it should be
    output of 2 batch
+---------------+-------------------+----+----+----+
|     product_id|          date_time| ack|val1|val2|
+---------------+-------------------+----+----+----+
|4008607333T.upf|2017-12-13:03:27:01|3-46|  30|  30|
|4008607333T.upf|2017-12-13:03:27:03|3-46|  20|  50|
+---------------+-------------------+----+----+----+

spark code

val w = Window.partitionBy($"product_id", $"ack")
  .orderBy($"date_time")
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)
val newDf = inputDF.withColumn("val_sum", sum($"val1").over(w))
  .withColumn("val2_sum", sum($"val2").over(w))

I don't have an opportunity to try it out right now, however, I believe `updateStateByKey()` could be what you are looking for, see https://docs.cloud.databricks.com/docs/latest/databricks_guide/07 Spark Streaming/11 Global Aggregations - updateStateByKey.html (yes, all that is the link, including spaces....) — Shaido, Dec 20 '17 at 15:03
Shaido, in updateStateByKey() I can update the value based on key but in my case key is product_id, date_time , ack. Can you please help me. — lucy, Dec 20 '17 at 15:39
Can you add the code here? I'm guessing the key can be a tuple, try using `(product_id, date_time, ack)` as key. — Shaido, Dec 20 '17 at 15:42
here is the example. http://amithora.com/spark-update-by-key-explained/ — lucy, Dec 20 '17 at 16:02

How to remember the previous batch of spark streaming to calculate cumulative sum

0 Answers0