I have a Spark structured streaming application that reads data from one Kafka topic, does data validation, and writes to multiple Delta tables. After releasing a new application version and redeployment, the first trigger processed much more data than configured with the maxOffsetsPerTrigger
option.
When I stopped the Spark app, the topic offset was 139104 messages
After application redeployment, the first trigger processed all these messages
Stream read by
val options = Map(
"failOnDataLoss" -> "false",
"kafka.bootstrap.servers" -> "kafka.kafka:9092",
"subscribe" -> "myTopic",
"groupIdPrefix" -> "myConsumerGroup",
"maxOffsetsPerTrigger" -> 5000
)
spark
.readStream
.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
.options(options).load.select("value")
Meantime the topic has more messages to process but from the second trigger
maxOffsetsPerTrigger
option works as expected
environment versions:
spark: 3.1.2
delta table: 1.0.0
kafka: 2.8.1