Spark structured streaming app skips maxOffsetsPerTrigger option on first run

Question

I have a Spark structured streaming application that reads data from one Kafka topic, does data validation, and writes to multiple Delta tables. After releasing a new application version and redeployment, the first trigger processed much more data than configured with the maxOffsetsPerTrigger option.

When I stopped the Spark app, the topic offset was 139104 messages
After application redeployment, the first trigger processed all these messages
Stream read by

val options = Map(
  "failOnDataLoss" -> "false",
  "kafka.bootstrap.servers" -> "kafka.kafka:9092",
  "subscribe" -> "myTopic",
  "groupIdPrefix" -> "myConsumerGroup",
  "maxOffsetsPerTrigger" -> 5000
)

spark
  .readStream
  .format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
  .options(options).load.select("value")

The next trigger processed 4998 records
Meantime the topic has more messages to process but from the second trigger maxOffsetsPerTrigger option works as expected

environment versions:

spark: 3.1.2

delta table: 1.0.0

kafka: 2.8.1

Spark structured streaming app skips maxOffsetsPerTrigger option on first run

0 Answers0