Spark Kafka: understanding offset management with enable.auto.commit

Question

according to the Kafka documentation offset in Kafka can be managed using enable.offset.commit and auto.commit.interval.ms. I have difficulties understanding the concept.

For example I have a Kafka that shall batch load everyday and only shall load the new entries since the last loading process. How do I configure both parameters and what are the pros and cons of auto offset management?

This site: https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/offset-management/ states the interval is normally set to 5 seconds. Does this mean the latest offset is updated 5 seconds after Kafka ran or genereally after 5 seconds the stored offset is updated regardless of the last run? If Kafka stores the offset itself how does the retrieving process work?

There is the starting offset parameter startingOffsets. How can I retrieve the last auto commited offset. Currently I understand you can set it only to earliest for batch or a manually input.

Edit: Added Code

spark.sparkContext.setCheckpointDir("directory")

df = spark.read.format("kafka") \
    .option("kafka.bootstrap.servers", bootstrap_server) \
    .option("kafka.sasl.mechanism", "PLAIN") \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.sasl.jaas.config", jaas_config) \
    .option("kafka.group.id", parameter_group)\
    .option("startingOffsets", parameter_offset_start) \
    .option("endingOffsets", parameter_offset_end) \
    .option("subscribe", topic) \
    .load()

Linked blog is about native Kafka client, not how Spark manages offsets, which include checkpoint files. You cannot set auto commit setting in Structured Streaming — OneCricketeer, Jul 25 '23 at 13:04
It maintains state when the app is running, and can resume when stopped/restarted https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing Note that `startingOffsets` is **optional** and is documented saying _resuming will **always pick up from where the query left off**_ — OneCricketeer, Jul 25 '23 at 14:38
Thanks for the documentation. I have tried setting a checkpoint using s'park.sparkContext.setCheckpointDir()'. Then I'm doing df=spark.read.format().option...' But it still gets all data instead the newest. It seem not to work or what am I doing wrong? — AzUser1, Jul 25 '23 at 19:17
Can you clarify what you mean by "newest"? When you start a Kafka consumer, it can give the earliest available offset/data when you start the app, but when you run it again, it'll maintain the position, and not have duplicates. Are you setting `kafka.group.id` also? Please [edit] your post with your full code — OneCricketeer, Jul 25 '23 at 21:47
I have added the code. Since starting offset is optional I have taken it out for testing but still I get all the entries from the topic instead of just ones after I run the code. To clarify the newest: I want to run the notebook each day as batch. I want to remember the offset it already loaded and on reach execution it shall only load the latest entries that hasn't been loaded yet to prevent duplicates. It works if I manually save the last offset in a database file and read it before loading but I would to do it automatically with kafka. — AzUser1, Jul 26 '23 at 07:14
this is what I'm doing: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries — AzUser1, Jul 26 '23 at 07:18
Can you use `kafka-consumer-groups` command to describe your `parameter_group` value before, after, and in between a few batch runs? Are any offsets stored in Kafka? If they're not, you might want to check you're using a Spark version that supports that option. Also, you should be using `option("checkpointLocation` on whatever output stream you're writing to — OneCricketeer, Jul 26 '23 at 14:34

Spark Kafka: understanding offset management with enable.auto.commit

0 Answers0