pyspark streaming - checkpoint files did not be cleaned automatically

Asked Aug 29 '17 at 01:53

Active Aug 29 '17 at 01:53

Viewed 606 times

My code is like below

def process(time, rdd):
    # do something with "previous_batch"
    df = df.cache().checkpoint()
    df.createOrReplaceTempView("previous_batch")
    del df

stream.foreachRDD(process)

I use this method to access the dataframe in previous batch. This is running on single node standalone cluster, so the checkpoint directory is set to /tmp. I expected that Spark will automatic delete checkpoint files after a period of time. But none of checkpoint files is deleted. I cannot figure out how to clean the checkpoint directory. Otherwise, the disk will run out of space when it runs for a long time. Should I run another process to clean the files by myself?

asked Aug 29 '17 at 01:53

Sonic

Maybe relevant: https://stackoverflow.com/q/43671757/1531971 – Aug 29 '17 at 02:09
df.cache().checkpoint() will save rdd to both memory and disk.Does it necessary? – Zhang Tong Aug 29 '17 at 02:23
@ZhangTong Actually, this dataframe will be reused in this function so I thought caching will faster. And checkpoint is for cutting lineage of the rdd. Is it enough to use checkpoint only? – Sonic Aug 29 '17 at 07:23

pyspark streaming - checkpoint files did not be cleaned automatically

0 Answers0