Spark structured streaming checkpoint size huge

Question

I am working on spark structured streaming project and i am facing one issue in chechpoint.

In our hdfs we have 25 days retention policy and its day wise partitions and we will delete the files from hdfs on daily basis but in my spark streaming my checkpnt files save all the file names from the job starting but if i cleanup my checkpnt directory i need to start my job again for 25 days so i need to drop my checkpnt files based on my retention policy but latest .compact file in checkpnt stores all the file names from starting please help me to resolve this issue.

Is this structured streaming (likely) or spark streaming? Can you show the "write-path" of a streaming query? Can you include it along with the directory listing? Thanks. — Jacek Laskowski, Dec 07 '19 at 15:54
Here is a solution: https://stackoverflow.com/a/72122240/1465609 — wind, May 17 '22 at 07:08

score 0 · Answer 1 · answered Dec 05 '19 at 19:34

0

You should not remove the checkpoint folder manually. There is a connector for that in the spark configuration: https://spark.apache.org/docs/latest/configuration.html#memory-management

spark.cleaner.referenceTracking.cleanCheckpoints

For DStreams there is also a cleanup method:

https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala

answered Dec 05 '19 at 19:34

Adam Dukkon

273
1
6

And what basis it will take the reference and identify if the files names are older than 25 days and also where can i set this property? – User6006 Dec 08 '19 at 17:28
If we set above property =true it will Controls whether to clean checkpoint files if the reference is out of scope and here my question is how does the system reveals the range out of scope? – User6006 Dec 08 '19 at 17:35
That's for Spark Core (for RDD checkpointing) and the old module Spark Streaming (not Structured Streaming). – Jacek Laskowski Dec 09 '19 at 06:16
above property is not working for my case still its holding old files and its all deleted from hdfs due to our retention but still file name persists in checkpoint files – User6006 Dec 09 '19 at 12:11
As far as I understand now, the main problem not about huge size, but the retention policy deletes the checkpoint folder after 25 days. I don't know if your streaming application is stateful or stateless, but if you just want to have a recovery option, you can store the Kafka offsets externally in ZooKeeper, or HBase: https://blog.cloudera.com/offset-management-for-apache-kafka-with-apache-spark-streaming/. In case it is a stateful application, you need checkpoint for logical reasons, so you would maybe need to modify your retention policy. – Adam Dukkon Dec 09 '19 at 19:56

Spark structured streaming checkpoint size huge

1 Answers1

Linked