7
spark.streams.addListener(new StreamingQueryListener() {
    ......
    override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
        println("Query made progress: " + queryProgress.progress)
    }
    ......
})

When StreamingQueryListener is added to Spark Structured Streaming session and output the queryProgress continuously, one of the metrics you will get is durationMs:

Query made progress: {
  ......
  "durationMs" : {
    "addBatch" : 159136,
    "getBatch" : 0,
    "getEndOffset" : 0,
    "queryPlanning" : 38,
    "setOffsetRange" : 14,
    "triggerExecution" : 159518,
    "walCommit" : 182
  }
  ......
}​

Can anyone told me what do those sub-metrics in durationMs meaning in spark context? For example, what is the meaning of "addBatch 159136".

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Machi
  • 403
  • 2
  • 14

1 Answers1

4

https://www.waitingforcode.com/apache-spark-structured-streaming/query-metrics-apache-spark-structured-streaming/read

This is an excellent site that addresses the aspects and more, passing the credit to this site therefore.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • 1
    That helped a lot and solve all my issues. Thanks very much. – Machi Apr 08 '20 at 06:34
  • I have the same confusion even after going through the link you mentioned. I am reading data from Kafka and inside `foreachBatch` loop, I am just printing the count of micro batch. What I observe here is that for all micro batches, addBatch is either 0 or 1, whereas addBatch is in thousands (11516, 9244, 8626 etc) So my question is why addBatch is just 0/1 ms as it should take some time to get data from Kafka which is not running on my local. I wonder if getBatch is either correct or its definition mentioned in above link is correct – conetfun May 31 '20 at 22:15
  • one must make do post a new question – thebluephantom Jun 01 '20 at 06:26