Why SPARK cached RDD spill to disk?

Question

I have next code, where I am repartition filtered input data and persist it:

val df = sparkSession.sqlContext.read
      .parquet(path)
      .as[struct1]
      .filter(dateRange(_,lowerBound,upperBound))
      .repartition(nrInputPartitions)
      .persist()

df.count

I expect all data to be stored in Memory, but instead I get the following in Spark UI:

Storage

Size in Memory   424.2 GB 
Size on Disk     44.1 GB

Is it because some partition didn't have enough Memory, and Spark automatically switched to MEMORY_AND_DISK storage level?

Possible duplicate of [spark cache only keeps a fraction of RDD](https://stackoverflow.com/questions/29502234/spark-cache-only-keeps-a-fraction-of-rdd) — vindev, Mar 15 '18 at 08:41

score 1 · Accepted Answer · answered Mar 15 '18 at 10:45

Is it because some partition didn't have enough Memory, and Spark automatically switched to MEMORY_AND_DISK storage level?

Almost. It is because it is not RDD, but Dataset, and default storage level for Datasets is MEMORY_AND_DISK. Otherwise your suspicion is true - if there is not enough memory or cache eviction is required data goes to disk (but technically speaking it is not aspill).

Why SPARK cached RDD spill to disk?

1 Answers1

Linked