My code is like below
def process(time, rdd):
# do something with "previous_batch"
df = df.cache().checkpoint()
df.createOrReplaceTempView("previous_batch")
del df
stream.foreachRDD(process)
I use this method to access the dataframe in previous batch. This is running on single node standalone cluster, so the checkpoint directory is set to /tmp. I expected that Spark will automatic delete checkpoint files after a period of time. But none of checkpoint files is deleted. I cannot figure out how to clean the checkpoint directory. Otherwise, the disk will run out of space when it runs for a long time. Should I run another process to clean the files by myself?