Spark Structured Streaming StreamingQueryListener.onQueryProgress not called per microbatch?

Question

I'm using Spark 3.0.2 and I have a streaming job that consumes data from Kafka with trigger duration of "1 minute".

I see in Spark UI that there is a new job every 1 minute as defined, but I see method onQueryProgress is being called every 5~6 minutes. I thought this method should be called directly after each microbatch.

Is there a way to control this duration and make it equals the trigger duration?

Is the structured streaming query processing data every micro-batch? Or could it be the case that you do not have any data within some of the micro-batchec so the inQueryProgress does not get triggered. — Michael Heil, Apr 19 '21 at 14:00
I can see in Spark UI that it process every minute as there is input data. — Mahmoud Hanafy, Apr 19 '21 at 14:43

score 1 · Answer 1 · answered Apr 19 '21 at 20:07

The inQueryProgress method of the StreamingQueryListener is called asynchronously after the data has been completely processed within each micro-batch.

You are seeing this listener being triggered only every 5~6 minutes because it takes the streaming job that time to process all the data fetched in the micro-batch. Setting the Trigger duration to 1 minute will have Spark to plan tasks accordingly but it does not mean that the job is also able to process all available data within this time frame of 1 minute.

To reduce the amount of data being fetched by your query from Kafka you can play around with the source option maxOffsetsPerTrigger.

By the way, if you are not processing any data, this method is called every 10 seconds by default. In case you want to avoid this from happening you can do an if(event.progress.numInputRows > 0).

score 0 · Answer 2 · answered Apr 21 '21 at 15:48

0

I found the reason for my case that onQueryProgress method was taking 5 minutes to complete.

as Mike mentioned that onQueryProgress is being called asynchronously, but I think it's using the same thread to call this method. So it's waiting for the method call to finish to call it again.

So the solution in my case was to figure out why it was taking that long and to make it faster than the trigger duration.

answered Apr 21 '21 at 15:48

Mahmoud Hanafy

1,861
3
24
33

Would it make sense to accept my answer or what is the difference to your answer? – Michael Heil Apr 21 '21 at 16:09
My answer is pretty different to your answer. the cause of my issue is that onQueryProgress method itself is taking 5 minutes to complete, but Spark is processing the microbatch in 1 minute. – Mahmoud Hanafy Apr 29 '21 at 12:40

Spark Structured Streaming StreamingQueryListener.onQueryProgress not called per microbatch?

2 Answers2