0

I have next code, where I am repartition filtered input data and persist it:

val df = sparkSession.sqlContext.read
      .parquet(path)
      .as[struct1]
      .filter(dateRange(_,lowerBound,upperBound))
      .repartition(nrInputPartitions)
      .persist()

df.count

I expect all data to be stored in Memory, but instead I get the following in Spark UI:

Storage

Size in Memory   424.2 GB 
Size on Disk     44.1 GB

Is it because some partition didn't have enough Memory, and Spark automatically switched to MEMORY_AND_DISK storage level?

philantrovert
  • 9,904
  • 3
  • 37
  • 61
jk1
  • 593
  • 6
  • 16
  • Possible duplicate of [spark cache only keeps a fraction of RDD](https://stackoverflow.com/questions/29502234/spark-cache-only-keeps-a-fraction-of-rdd) – vindev Mar 15 '18 at 08:41

1 Answers1

1

Is it because some partition didn't have enough Memory, and Spark automatically switched to MEMORY_AND_DISK storage level?

Almost. It is because it is not RDD, but Dataset, and default storage level for Datasets is MEMORY_AND_DISK. Otherwise your suspicion is true - if there is not enough memory or cache eviction is required data goes to disk (but technically speaking it is not aspill).

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115