I have an Azure Databricks Spark cluster consisting of 6 nodes (5 workers + 1 driver) of 16 cores & 64GB memory each.
I'm running a PySpark notebook that:
- reads a DF from parquet files.
- caches it (
df.cache()
). - executes an action on it (
df.toPandas()
).
From the SparkUI-Storage I see the cached DF takes up 9.6GB in memory, divided into 28 files, taking up 3GB+ on-heap memory of 3 workers:
At this point, I see from the mem_report on Ganglia, that the 3 workers' on-heap memory is being used (i.e. the 40g -- see spark configs below).
Next, I clear the DF from cache (df.unpersist(True)
), and after doing that, I correctly see the storage object gone, and the workers' storage memory (almost) emptied:
but my workers' executor memory is never released (not even after I detach my notebook from the cluster):
My question is, how can I get the workers to clear their executor memory? Is it a GC problem (setting the G1GC didnt help either -- see comments below)?
Thanks!
These are my -relevant- Spark config settings:
spark.executor.memory 40g
spark.memory.storageFraction .6
spark.databricks.io.cache.enabled true
spark.cleaner.periodicGC.interval 2m
spark.sql.execution.arrow.enabled true
spark.storage.cleanupFilesAterExecutorExit true
spark.worker.cleanup.enabled true
Setting the G1GC as follows, did not have an impact on the mem:
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:+InitiatingHeapOccupancyPercent=25 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
For the purposes of my experiment, there's nothing running in the cluster before nor after the job execution.