When the query execution In Spark Structured Streaming has no setting about trigger,
import org.apache.spark.sql.streaming.Trigger
// Default trigger (runs micro-batch as soon as it can)
df.writeStream
.format("console")
//.trigger(???) // <--- Trigger intentionally omitted ----
.start()
As of Spark 2.4.3 (Aug 2019). The Structured Streaming Programming Guide - Triggers says
If no trigger setting is explicitly specified, then by default, the query will be executed in micro-batch mode, where micro-batches will be generated as soon as the previous micro-batch has completed processing.
QUESTION: On which basis the default trigger determines the size of the micro-batches?
Let's say. The input source is Kafka. The job was interrupted for a day because of some outages. Then the same Spark job is restarted. It will then consume messages where it left off. Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped? Let assume the job takes 10 hours to process that big batch, then the next micro-batch has 10h worth of messages? And gradually until X iterations to catchup the backlog to arrive to smaller micro-batches.