I have built a spark structured streaming application which reads the data from kafka topics,I have specified the starting offsets as latest and what happens if there is any failure from spark side, from which point/offset the data will continue to read after restarting and is it good idea to have checkpoint specified in the write stream to make sure we are reading from the point where the application/spark has failed? Please let me know.
Asked
Active
Viewed 382 times
2 Answers
2
I would advise you to set offsets to earliest
and configure a checkpointLocation
(HDFS, MinIO, other). The setting kafka.group.id
will not commit offsets back to Kafka (even in Spark 3+), unless you commit them manually using foreachBatch
.

Christos Natsis
- 41
- 4
0
You can use checkpoints, yes, or you can set kafka.group.id
(in Spark 3+, at least).
Otherwise, it may start back at the end of the topic

OneCricketeer
- 179,855
- 19
- 132
- 245