Specify which file system spark uses for spilling RDDs

Question

How do we specify the local (unix) file system where Spark spills RDDs when they won't fit in memory? We cannot find this in the documentation. Analysis confirms that it is being saved in the Unix file system, not in HDFS.

We are running on Amazon with Elastic Map Reduce. Spark is spilling to /mnt. On our system, /mnt is an EBS volume while /mnt1 is a SSD. We want to spill to /mnt. If that fills up, we want to spill to /mnt2. We want /mnt to be the spillage of last resort. It's unclear how to configure this way, and how to monitor spilling.

We have reviewed the existing SO questions:

Understanding Spark shuffle spill appears out of date.
Why SPARK cached RDD spill to disk? and Use SSD for SPARK RDD discuss spill behavior, but not where the files are spilled.
Spark shuffle spill metrics is an unanswered question showing the Spill UI, but does not provide the details we are requesting.

score 1 · Accepted Answer · answered Nov 19 '19 at 04:30

Checkout https://spark.apache.org/docs/2.2.1/configuration.html#application-properties and search for

spark.local.dir

This defaults to /tmp, try and set it to the location of your EBS

NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

Also look at the following stack overflow post for more insightful info

Specify which file system spark uses for spilling RDDs

1 Answers1