0

As far as i understand,for a spark streaming application(structured streaming or otherwise),to manually manage the offsets ,spark provides the feature of checkpointing where you just have to configure the checkpoint location(hdfs most of the times) while writing the data to your sink and spark itself will take care of managing the offsets.

But i see a lot of usecases where checkpointing is not preferred and instead an offset managemenent framework is created to save offets in hbase or mongodb etc. I just wanted to understand why checkpointing is not preferred and instead a custom framework is created to manage the offsets? Is it because it will lead to creation of small file problem in hdfs?

https://blog.cloudera.com/offset-management-for-apache-kafka-with-apache-spark-streaming/

Gaurav Gupta
  • 159
  • 1
  • 17

1 Answers1

1

Small files is just one problem for HDFS. Zookeeper would be more recommended out of your listed options since you'd likely have a Zookeeper cluster (or multiple) as part of Kafka and Hadoop ecosystem.

The reason checkpoints aren't used is because they are highly coupled to the code's topology. For example, if you run map, filter, reduce or other Spark functions, then the exact order of those matters, and are used by the checkpoints.

Storing externally will keep consistent ordering, but with different delivery semantics.

You could also just store in Kafka itself (but disable auto commits)

https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Thanks for detailed answer but I still couldn't understand the part "highly coupled to code's topology" and the thing about different delivery semantics.Can you elaborate plz as how it is a problem? – Gaurav Gupta May 16 '22 at 19:24
  • Read the link? _Checkpoints... there are drawbacks. Your output operation must be idempotent, since you will get repeated outputs; transactions are not an option. Furthermore, **you cannot recover from a checkpoint if your application code has changed**_. For delivery semantics - Kafka supports at-least-once delivery, by default (you could get dupicates). Using an external store might allow for exactly-once-delivery, if you manage the offsets well-enough. – OneCricketeer May 16 '22 at 19:27
  • Yes I went through the link and I could understand that it is talking of some possible duplication but couldn't understand the part "your output operations should be idempotent" . Is it suggesting to avoid idempotent operations? What kind of operations are considered idempotent in spark? – Gaurav Gupta May 16 '22 at 19:32
  • I would interpret that section as saying your code should be finalized before considering to use checkpoints. So, you don't want to use checkpoints while developing or if error-handling isn't fully thought through. Those changes would require code changes like pre-validating all incoming data before it continues to be processed. Idempotence depends on the inputs and outputs. If you modify the function of a map/filter/reduce/foreach, etc then those are not idempotent changes between different deployments of the same application. – OneCricketeer May 16 '22 at 20:23