Spark 2.4.0 Structured Streaming Kafka Consumer Checkpointing

Question

I am using Spark 2.4.0 Structured Streaming (Batch Mode i.e. spark .read vs .readstream)to consume a Kafka topic. I am checkpointing read offsets and using the .option("startingOffsets", ...) to dictate where to continue reading on next job run.

In the docs is says Newly discovered partitions during a query will start at earliest. However testing showed that when a new partition is added and I use the last checkpoint info, I get the following error: Caused by: java.lang.AssertionError: assertion failed: If startingOffsets contains specific offsets, you must specify all TopicPartitions.

How can I check programmatically if any new partitions were created so that I can update my startingOffsets param?

algorythms · Answer 1 · 2023-07-15T15:44:39.760

0

So to handle new partitions in Kafka with spark structured streaming, you could try this:

First, fetch the Kafka topic partitions using the listTopics() function from Kafka's AdminClient API.
Compare these with the checkpointed offsets.
For new partitions, set the starting offsets to "earliest" or any desired value. For existing partitions, use checkpointed offsets.
Pass these offsets to Spark's startingOffsets option.

Here is an example of using AdminClient:

from confluent_kafka.admin import AdminClient

admin_client = AdminClient({'bootstrap.servers': 'localhost:9092'})

topics_metadata = admin_client.list_topics().topics
for topic, metadata in topics_metadata.items():
    print(f"Topic: {topic}")
    for partition in metadata.partitions.values():
        print(f"Partition: {partition.id}")

edited Jul 15 '23 at 15:44

answered Jul 15 '23 at 15:34

algorythms

1,547
1
15
28

Can you share an example of the AdminClient API ? Can you point to the docs for the lib? – bzak Jul 15 '23 at 15:36
@bzak edited my answer above with an example – algorythms Jul 15 '23 at 15:44
The provided code is plain python, not Spark (which would import the JVM Kafka classes)... So, how does printing them with a completely different library help with Spark? – OneCricketeer Jul 15 '23 at 21:40

Spark 2.4.0 Structured Streaming Kafka Consumer Checkpointing

1 Answers1