offset management in spark streaming

Question

As far as i understand,for a spark streaming application(structured streaming or otherwise),to manually manage the offsets ,spark provides the feature of checkpointing where you just have to configure the checkpoint location(hdfs most of the times) while writing the data to your sink and spark itself will take care of managing the offsets.

But i see a lot of usecases where checkpointing is not preferred and instead an offset managemenent framework is created to save offets in hbase or mongodb etc. I just wanted to understand why checkpointing is not preferred and instead a custom framework is created to manage the offsets? Is it because it will lead to creation of small file problem in hdfs?

https://blog.cloudera.com/offset-management-for-apache-kafka-with-apache-spark-streaming/

Why not store offsets in Kafka, as mentioned in the Spark documentation? — OneCricketeer, May 16 '22 at 00:10
you mean utilizing the enable.auto.commit=true feature where commits are saved automatically in __consumer_offsets kafka topic? — Gaurav Gupta, May 16 '22 at 05:07

score 1 · Accepted Answer · answered May 16 '22 at 14:35

1

Small files is just one problem for HDFS. Zookeeper would be more recommended out of your listed options since you'd likely have a Zookeeper cluster (or multiple) as part of Kafka and Hadoop ecosystem.

The reason checkpoints aren't used is because they are highly coupled to the code's topology. For example, if you run map, filter, reduce or other Spark functions, then the exact order of those matters, and are used by the checkpoints.

Storing externally will keep consistent ordering, but with different delivery semantics.

You could also just store in Kafka itself (but disable auto commits)

https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets

answered May 16 '22 at 14:35

OneCricketeer

179,855
19
132
245

Thanks for detailed answer but I still couldn't understand the part "highly coupled to code's topology" and the thing about different delivery semantics.Can you elaborate plz as how it is a problem? – Gaurav Gupta May 16 '22 at 19:24
Read the link? _Checkpoints... there are drawbacks. Your output operation must be idempotent, since you will get repeated outputs; transactions are not an option. Furthermore, **you cannot recover from a checkpoint if your application code has changed**_. For delivery semantics - Kafka supports at-least-once delivery, by default (you could get dupicates). Using an external store might allow for exactly-once-delivery, if you manage the offsets well-enough. – OneCricketeer May 16 '22 at 19:27
Yes I went through the link and I could understand that it is talking of some possible duplication but couldn't understand the part "your output operations should be idempotent" . Is it suggesting to avoid idempotent operations? What kind of operations are considered idempotent in spark? – Gaurav Gupta May 16 '22 at 19:32
I would interpret that section as saying your code should be finalized before considering to use checkpoints. So, you don't want to use checkpoints while developing or if error-handling isn't fully thought through. Those changes would require code changes like pre-validating all incoming data before it continues to be processed. Idempotence depends on the inputs and outputs. If you modify the function of a map/filter/reduce/foreach, etc then those are not idempotent changes between different deployments of the same application. – OneCricketeer May 16 '22 at 20:23

offset management in spark streaming

1 Answers1