1

I have encountered many issues using checkpoints with spark streaming on databricks. The code below led to OOM errors on our clusters. Investigating the cluster's memory usage, we could see that the memory was slowly increasing over time, indicating a memory leak (~10 days before OOM, while a batch only lasts a couple of minutes). After deleting the checkpoint so that a new one could be created, the memory leak disappeared, indicating the error originated from the checkpoint. In a similar streaming job, we also had a problem where some data were never being processed (again, fixed after re-creating the checkpoint).

Disclaimer: I do not understand perfectly the in-depth behaviours of checkpoints as online documentation is evasive. Hence, I am not sure my configuration is good.

Below is a minimal example of the problem:

pyspark 3.0.1, python 3.7

The json conf of the clusters has the following element:

  "spark_conf": {
    "spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite": "true",
    "spark.databricks.delta.properties.defaults.autoOptimize.autoCompact": "true"
  }

code:

import pandas as pd
from pyspark.sql import functions as F

def for_each_batch(data, epoch_id):
   pass

spark.readStream.format("delta").load("path/to/delta").filter(
F.col("TIME") > pd.Timestamp.utcnow() - pd.Timedelta(hours=1)
).writeStream.option(
"ignoreChanges", "true"
).option(
"checkpointLocation", "path/to/checkpoint"
).trigger(
processingTime="3 minutes"
).foreachBatch(
for_each_batch
).start()

PS: If the content of the function 'for_each_batch' or the filtering condition is changed, should I re-create the checkpoint?

Noé Achache
  • 195
  • 2
  • 9
  • lower your batch size, i was getting this issue as well. If you want to keep the batch size the same, then youll have to do something on the server side in the cluster to handle the OOMs – Joe Oct 14 '21 at 19:59

0 Answers0