How do we specify the local (unix) file system where Spark spills RDDs when they won't fit in memory? We cannot find this in the documentation. Analysis confirms that it is being saved in the Unix file system, not in HDFS.
We are running on Amazon with Elastic Map Reduce. Spark is spilling to /mnt
. On our system, /mnt
is an EBS volume while /mnt1
is a SSD. We want to spill to /mnt
. If that fills up, we want to spill to /mnt2
. We want /mnt
to be the spillage of last resort. It's unclear how to configure this way, and how to monitor spilling.
We have reviewed the existing SO questions:
- Understanding Spark shuffle spill appears out of date.
- Why SPARK cached RDD spill to disk? and Use SSD for SPARK RDD discuss spill behavior, but not where the files are spilled.
- Spark shuffle spill metrics is an unanswered question showing the Spill UI, but does not provide the details we are requesting.