2

Is there a possibility to perform some action at the end of each micro-batch inside the DStream in Spark Streaming? My aim is to compute number of the events processed by Spark. Spark Streaming gives me some numbers, but the average also seems to sum up zero values (as some micro-batches are empty).

e.g. I do collect some statistics data and want to send them to my server, but the object that collects the data only exists during a certain batch and is initialized from the scratch for the next batch. I would love to be able to call my "finish" method before the batch is done and the object is gone. Otherwise I loose the data that has not been sent to my server.

chAlexey
  • 692
  • 8
  • 13
  • do you have some code example of what's not working for you? – maasg May 12 '16 at 19:47
  • It is kind of difficult to explain. We use our previously used code written in java. It is embedded inside of map-function. Our operator that collects performance data and sends it to our server is being reinitialized at every new batch. It would be good to be able to send data to our server before our operator is being "killed". – chAlexey May 13 '16 at 19:11

1 Answers1

0

Maybe you can use StreamingListener:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener

javierhe
  • 553
  • 1
  • 7
  • 21
  • This seems to be quite good direction. I will definitely try it out through the weekend. :) – chAlexey May 13 '16 at 19:12
  • It was a good advice, BUT: such a listener is initialized by the driver. As my code is executed on some executor I need to call my "finish" function on an executor. This way I do not receive any update on such events as batch completed. Do you know any possible workaround? – chAlexey May 19 '16 at 12:43