0

I have a Spark structured streaming application that reads data from one Kafka topic, does data validation, and writes to multiple Delta tables. After releasing a new application version and redeployment, the first trigger processed much more data than configured with the maxOffsetsPerTrigger option.

  1. When I stopped the Spark app, the topic offset was 139104 messages enter image description here

  2. After application redeployment, the first trigger processed all these messages enter image description here

  3. Stream read by

val options = Map(
  "failOnDataLoss" -> "false",
  "kafka.bootstrap.servers" -> "kafka.kafka:9092",
  "subscribe" -> "myTopic",
  "groupIdPrefix" -> "myConsumerGroup",
  "maxOffsetsPerTrigger" -> 5000
)

spark
  .readStream
  .format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
  .options(options).load.select("value")
  1. The next trigger processed 4998 records enter image description here

  2. Meantime the topic has more messages to process but from the second trigger maxOffsetsPerTrigger option works as expected enter image description here

environment versions:

spark: 3.1.2

delta table: 1.0.0

kafka: 2.8.1

Doroshenko
  • 111
  • 1
  • 3

0 Answers0