Understanding checkpointing in kakfa structured streaming

Question

In this (https://dzone.com/articles/what-are-spark-checkpoints-on-dataframes) article it says that checkpointing is used to "freeze the content of a dataframe before I do something else".

However in this (http://blog.madhukaraphatak.com/introduction-to-spark-structured-streaming-part-7/) article it says that checkpointing is used to recover from failure. From this I gather that if spark is processing a kafka topic and spark crashes, after it restarts will it start processing from the offsets where it last checkpointed. Is this statement correct?

Are there 2 different concepts of checkpointing in spark? Because I can't reconcile the 2.

Related: https://stackoverflow.com/questions/39599863/is-checkpointing-necessary-in-spark-streaming — shanmuga, Mar 11 '19 at 10:51

score 1 · Answer 1 · answered Mar 12 '19 at 13:13

1

Simpler answer would be if you are just consuming from kafka, transforming and loading the information to other system, you don't need to have checkpointing, kafka offset commits should be enough.

However if you are doing windowing and to calculate running aggregates (e.g running avg of last 5 hours) then the (previously extracted) data for the time window (last 5 hours in this case) is stored in checkpoint. This is what is meant by

freeze the content of a dataframe before I do something else

In the absence of checkpointing, when the spark application is restarted the running aggregates will reset (since only data received after last committed offset will be consumed from kafka).

Based on answer from: Is checkpointing necessary in spark streaming

answered Mar 12 '19 at 13:13

shanmuga

4,329
2
21
35

What about if I don't have a stateful operation, but I am reading kafka offsets from earliest, and the application crashes. Will it know not to process the events it has already processed? – Funzo Mar 12 '19 at 14:08
In this case checking for already processed events is not the responsibility of spark application, it is the responsibility of kafka cluster. As long as you use the same group_id (after application restart) you will get records since the last committed offset. For this to work properly you should disable auto-commit, and explicitly commit offset immediately after you finished processing a micro-batch. – shanmuga Mar 13 '19 at 07:04
Also use `spark.streaming.stopGracefullyOnShutdown` while building the SparkSession. Even after all this there is very slight chance some records may be processed, when the application crashes after processing the micro-batch and before committing the offset to Kafka. But this should not happen under normal execution and termination of the application. – shanmuga Mar 13 '19 at 07:09

Understanding checkpointing in kakfa structured streaming

1 Answers1