0

My code is like below

def process(time, rdd):
    # do something with "previous_batch"
    df = df.cache().checkpoint()
    df.createOrReplaceTempView("previous_batch")
    del df

stream.foreachRDD(process)

I use this method to access the dataframe in previous batch. This is running on single node standalone cluster, so the checkpoint directory is set to /tmp. I expected that Spark will automatic delete checkpoint files after a period of time. But none of checkpoint files is deleted. I cannot figure out how to clean the checkpoint directory. Otherwise, the disk will run out of space when it runs for a long time. Should I run another process to clean the files by myself?

Sonic
  • 196
  • 1
  • 5
  • Maybe relevant: https://stackoverflow.com/q/43671757/1531971 –  Aug 29 '17 at 02:09
  • df.cache().checkpoint() will save rdd to both memory and disk.Does it necessary? – Zhang Tong Aug 29 '17 at 02:23
  • @ZhangTong Actually, this dataframe will be reused in this function so I thought caching will faster. And checkpoint is for cutting lineage of the rdd. Is it enough to use checkpoint only? – Sonic Aug 29 '17 at 07:23

0 Answers0