pyspark foreachBatch reading same data again after restarting the stream

Question

df = spark.readStream.option("readChangeFeed", "true").option("startingVersion", 2).load(tablePath)

def foreach_batch_function(df, epoch_id):
  print("epoch_id: ", epoch_id)
  df.write.mode("append").json("/mnt/sample/data/test/")

df.writeStream.foreachBatch(foreach_batch_function).start()

When I terminate the writestream and run again foreachBatch processes same data again. How can we maintain checkpoints avoid this situation of reading old data?

score 1 · Answer 1 · answered Oct 20 '21 at 07:41

This answers my question. https://kb.databricks.com/streaming/checkpoint-no-cleanup-foreachbatch.html

You should manually specify the checkpoint directory with the checkpointLocation option.

streamingDF.writeStream.option("checkpointLocation","<checkpoint-path>").outputMode("append").foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write.format("parquet").mode("overwrite").save(output_directory)
}.start()

pyspark foreachBatch reading same data again after restarting the stream

1 Answers1