Spark Databricks fraction cached is dropping after an action (Scala)

Question

I use persist command to cache a dataframe on MEMORY_AND_DISK and have been observing a weird pattern.

A persisted dataframe is cached to 100% when that specific job (Job 6, in the below screenshot) which does the necessary transformations is complete, but post Job 9 (data quality check) it dropped the fraction cached to 55% which made it to recompute to get the partially lost data (can be seen in Job 12). I have also seen from the metrics (Ganglia UI on Databricks) that at any given instance there was at least 50 GB of memory available.

(Below image is partially masked to avoid exposure of sensitive data)

Why would Spark discard/flush an object of 50 MB persisting on memory/disk when there is enough memory for the other transformations/actions? Is there a solution to avoid this apart from a workaround of writing it to a temporary storage explicitly?

Maybe some of your executors got dropped, did you check that ? What's the timeout you've set on your executors? — Fragan, Dec 16 '22 at 12:16
@Fragan no there were no executor drops or failure nor the cluster has resized at all around that timeframe. This has been the behaviour from a long while and has been observed by my colleagues too — Sushant Pachipulusu, Dec 17 '22 at 14:51
Can you share some code? Ideally not the full code but a minimal version that allows to reproduce the problem. — Oli, Dec 18 '22 at 19:17
As far as I know, Spark by itself wouldn't unpersist part of a dataframe. Worst case, under memory pressure, it evicts cached blocks to disk (following LRU strategy). So I'd suspect some sort of infrastructure issue, and talk to DB Support maybe? Still, good question! — mazaneicha, Dec 19 '22 at 15:14

score -1 · Answer 1 · answered Dec 22 '22 at 23:31

Spark's cache also has a configurable size limit, which is specified using the spark.storage.memoryFraction configuration property. By default, this property is set to 60% of the available memory on the executor can be used for caching. If you are caching a large number of dataframes and the total size of the cache exceeds this limit, Spark will start evicting dataframes to stay within the limit.

To avoid these issues, you can try increasing the size of the cache by setting the spark.storage.memoryFraction configuration property to a higher value. You can also try using the spark.cacheTable() method to explicitly cache the dataframe, which will cause Spark to prioritize keeping the data in the cache over other dataframes.

Thanks for bringing up some important configurations. We are already using df.cache() method to explicitly cache it and in this case memoryFraction configuration wasn't an issue as that hasn't breached the limit — Sushant Pachipulusu, Dec 25 '22 at 03:49

Spark Databricks fraction cached is dropping after an action (Scala)

1 Answers1