according to the Kafka documentation offset in Kafka can be managed using enable.offset.commit
and auto.commit.interval.ms
. I have difficulties understanding the concept.
For example I have a Kafka that shall batch load everyday and only shall load the new entries since the last loading process. How do I configure both parameters and what are the pros and cons of auto offset management?
This site: https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/offset-management/ states the interval is normally set to 5 seconds. Does this mean the latest offset is updated 5 seconds after Kafka ran or genereally after 5 seconds the stored offset is updated regardless of the last run? If Kafka stores the offset itself how does the retrieving process work?
There is the starting offset parameter startingOffsets
. How can I retrieve the last auto commited offset. Currently I understand you can set it only to earliest for batch or a manually input.
Edit: Added Code
spark.sparkContext.setCheckpointDir("directory")
df = spark.read.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_server) \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", jaas_config) \
.option("kafka.group.id", parameter_group)\
.option("startingOffsets", parameter_offset_start) \
.option("endingOffsets", parameter_offset_end) \
.option("subscribe", topic) \
.load()