6

I use persist command to cache a dataframe on MEMORY_AND_DISK and have been observing a weird pattern.

A persisted dataframe is cached to 100% when that specific job (Job 6, in the below screenshot) which does the necessary transformations is complete, but post Job 9 (data quality check) it dropped the fraction cached to 55% which made it to recompute to get the partially lost data (can be seen in Job 12). I have also seen from the metrics (Ganglia UI on Databricks) that at any given instance there was at least 50 GB of memory available.

(Below image is partially masked to avoid exposure of sensitive data) enter image description here

Why would Spark discard/flush an object of 50 MB persisting on memory/disk when there is enough memory for the other transformations/actions? Is there a solution to avoid this apart from a workaround of writing it to a temporary storage explicitly?

Sushant Pachipulusu
  • 5,499
  • 1
  • 18
  • 30
  • Maybe some of your executors got dropped, did you check that ? What's the timeout you've set on your executors? – Fragan Dec 16 '22 at 12:16
  • @Fragan no there were no executor drops or failure nor the cluster has resized at all around that timeframe. This has been the behaviour from a long while and has been observed by my colleagues too – Sushant Pachipulusu Dec 17 '22 at 14:51
  • 1
    Can you share some code? Ideally not the full code but a minimal version that allows to reproduce the problem. – Oli Dec 18 '22 at 19:17
  • As far as I know, Spark by itself wouldn't unpersist part of a dataframe. Worst case, under memory pressure, it evicts cached blocks to disk (following LRU strategy). So I'd suspect some sort of infrastructure issue, and talk to DB Support maybe? Still, good question! – mazaneicha Dec 19 '22 at 15:14

1 Answers1

-1

Spark's cache also has a configurable size limit, which is specified using the spark.storage.memoryFraction configuration property. By default, this property is set to 60% of the available memory on the executor can be used for caching. If you are caching a large number of dataframes and the total size of the cache exceeds this limit, Spark will start evicting dataframes to stay within the limit.

To avoid these issues, you can try increasing the size of the cache by setting the spark.storage.memoryFraction configuration property to a higher value. You can also try using the spark.cacheTable() method to explicitly cache the dataframe, which will cause Spark to prioritize keeping the data in the cache over other dataframes.