0

I've set up a Spark structured streaming query that reads from a Kafka topic. If the number of partitions in the topic is changed while the Spark query is running, Spark does not seem to notice and data on new partitions is not consumed.

Is there a way to tell Spark to check for new partitions in the same topic apart from stopping the query an restarting it?

EDIT: I'm using Spark 2.4.4. I read from kafka as follows:

spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", kafkaURL)
      .option("startingOffsets", "earliest")
      .option("subscribe", topic)
      .option("failOnDataLoss", value = false)
      .load()

after some processing, I write to HDFS on a Delta Lake table.

redsk
  • 261
  • 6
  • 11
  • What's the version of Spark? Can you show the code that you use to access Kafka? What you describe should be supported out of the box. – Jacek Laskowski Nov 08 '19 at 22:23

1 Answers1

0

Answering my own question. Kafka consumers check for new partitions/topic (in case of subscribing to topics with a pattern) every metadata.max.age.ms, whose default value is 300000 (5 minutes).

Since my test was lasting far less than that, I wouldn't notice the update. For tests, reduce the value to something smaller, e.g. 100 ms, by setting the following option of the DataStreamReader:

.option("kafka.metadata.max.age.ms", 100)
redsk
  • 261
  • 6
  • 11