0

I am working on spark streaming project where I have to calculate the cumulative sum of one column of dataframe. I have successfully calculated the cumulative sum using this link. But spark can only calculate the sum within the batch. For next batch it start with scratch. I have to apply the logic on previous and upcoming batch. How I can store all the upcoming data or remember the previous spark batch to calculate the cumulative sum.

output of 1 batch

+---------------+-------------------+----+----+----+
|     product_id|          date_time| ack|val1|val2|
+---------------+-------------------+----+----+----+
|4008607333T.upf|2017-12-13:02:27:01|3-46|  50|  50|
|4008607333T.upf|2017-12-13:02:27:03|3-46|  60| 110|

+---------------+-------------------+----+----+----+
output of 2 batch
+---------------+-------------------+----+----+----+
|     product_id|          date_time| ack|val1|val2|
+---------------+-------------------+----+----+----+
|4008607333T.upf|2017-12-13:03:27:01|3-46|  30|  30|
|4008607333T.upf|2017-12-13:03:27:03|3-46|  20|  50|
+---------------+-------------------+----+----+----+

it should be
    output of 2 batch
+---------------+-------------------+----+----+----+
|     product_id|          date_time| ack|val1|val2|
+---------------+-------------------+----+----+----+
|4008607333T.upf|2017-12-13:03:27:01|3-46|  30|  30|
|4008607333T.upf|2017-12-13:03:27:03|3-46|  20|  50|
+---------------+-------------------+----+----+----+

spark code

val w = Window.partitionBy($"product_id", $"ack")
  .orderBy($"date_time")
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)
val newDf = inputDF.withColumn("val_sum", sum($"val1").over(w))
  .withColumn("val2_sum", sum($"val2").over(w))
lucy
  • 4,136
  • 5
  • 30
  • 47
  • I don't have an opportunity to try it out right now, however, I believe `updateStateByKey()` could be what you are looking for, see https://docs.cloud.databricks.com/docs/latest/databricks_guide/07 Spark Streaming/11 Global Aggregations - updateStateByKey.html (yes, all that is the link, including spaces....) – Shaido Dec 20 '17 at 15:03
  • Shaido, in updateStateByKey() I can update the value based on key but in my case key is product_id, date_time , ack. Can you please help me. – lucy Dec 20 '17 at 15:39
  • Can you add the code here? I'm guessing the key can be a tuple, try using `(product_id, date_time, ack)` as key. – Shaido Dec 20 '17 at 15:42
  • here is the example. http://amithora.com/spark-update-by-key-explained/ – lucy Dec 20 '17 at 16:02

0 Answers0