Here is how we "solved" this. This is basically the approach mike
wrote about in the accepted answer.
In our case, the size of the message varied very little and we therefore knew how much time the processing of a batch takes. So in a nutshell we:
- changed the
Trigger.Once()
with Trigger.ProcessingTime(<ms>)
since maxOffsetsPerTrigger
works with this mode
- killed this running query by calling
awaitTermination(<ms>)
to mimic Trigger.Once()
- set the processing interval to be larger than the termination interval so exactly one "batch" would fit to be processed
val kafkaOptions = Map[String, String](
"kafka.bootstrap.servers" -> "localhost:9092",
"failOnDataLoss" -> "false",
"subscribePattern" -> "testTopic",
"startingOffsets" -> "earliest",
"maxOffsetsPerTrigger" -> "10", // "batch" size
)
val streamWriterOptions = Map[String, String](
"checkpointLocation" -> "path/to/checkpoints",
)
val processingInterval = 30000L
val terminationInterval = 15000L
sparkSession
.readStream
.format("kafka")
.options(kafkaOptions)
.load()
.writeStream
.options(streamWriterOptions)
.format("Console")
.trigger(Trigger.ProcessingTime(processingInterval))
.start()
.awaitTermination(terminationInterval)
This works because the first batch will be read and processed respecting the maxOffsetsPerTrigger
limit. Say, in 10 seconds. The second batch is then started to be processed but it is terminated in the middle of the operation after ~5s and never reaches the set 30s mark. But it stores the offsets correctly. picks up and processes this "killed" batch in the next run.
A downside of this approach is that you have to approximately know how much time does it take to process one "batch" - if you set the terminationInterval
too low the job's output will constantly be nothing.
Of course, if you don't care about the exact number of batches you process in one run, you can easily adjust the processingInterval
to be times smaller than the terminationInterval
. In that case, you may process a varying number of batches in one go but still respecting the value of maxOffsetsPerTrigger
.