I have encountered many issues using checkpoints with spark streaming on databricks. The code below led to OOM errors on our clusters. Investigating the cluster's memory usage, we could see that the memory was slowly increasing over time, indicating a memory leak (~10 days before OOM, while a batch only lasts a couple of minutes). After deleting the checkpoint so that a new one could be created, the memory leak disappeared, indicating the error originated from the checkpoint. In a similar streaming job, we also had a problem where some data were never being processed (again, fixed after re-creating the checkpoint).
Disclaimer: I do not understand perfectly the in-depth behaviours of checkpoints as online documentation is evasive. Hence, I am not sure my configuration is good.
Below is a minimal example of the problem:
pyspark 3.0.1, python 3.7
The json conf of the clusters has the following element:
"spark_conf": {
"spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite": "true",
"spark.databricks.delta.properties.defaults.autoOptimize.autoCompact": "true"
}
code:
import pandas as pd
from pyspark.sql import functions as F
def for_each_batch(data, epoch_id):
pass
spark.readStream.format("delta").load("path/to/delta").filter(
F.col("TIME") > pd.Timestamp.utcnow() - pd.Timedelta(hours=1)
).writeStream.option(
"ignoreChanges", "true"
).option(
"checkpointLocation", "path/to/checkpoint"
).trigger(
processingTime="3 minutes"
).foreachBatch(
for_each_batch
).start()
PS: If the content of the function 'for_each_batch' or the filtering condition is changed, should I re-create the checkpoint?