I have a long running structured streaming job which consumes several Kafka topics and aggregated over a sliding window. I need to understand how checkpoints are managed/cleaned up within HDFS.
Jobs run fine and I am able to recover from a failed step with no data loss, however, I can see the HDFS utilisation increasing day by day. I can not find any documentation around how Spark manages/cleans up the checkpoints. Previously the checkpoints were being stored on s3 but this turned out to be quite costly with the large amount of small files being read/written.
query = formatted_stream.writeStream \
.format("kafka") \
.outputMode(output_mode) \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("checkpointLocation", "hdfs:///path_to_checkpoints") \
.start()
From what I understand the checkpoints should be cleaned up automatically; after several days I just see my HDFS utilisation increasing linearly. How can I ensure the checkpoints are managed and HDFS does not run out of space?
The accepted answer to Spark Structured Streaming Checkpoint Cleanup informs that Structured Streaming should deal with this issue, but not how or how it can be configured.