0

How do we specify the local (unix) file system where Spark spills RDDs when they won't fit in memory? We cannot find this in the documentation. Analysis confirms that it is being saved in the Unix file system, not in HDFS.

We are running on Amazon with Elastic Map Reduce. Spark is spilling to /mnt. On our system, /mnt is an EBS volume while /mnt1 is a SSD. We want to spill to /mnt. If that fills up, we want to spill to /mnt2. We want /mnt to be the spillage of last resort. It's unclear how to configure this way, and how to monitor spilling.

We have reviewed the existing SO questions:

vy32
  • 28,461
  • 37
  • 122
  • 246

1 Answers1

1

Checkout https://spark.apache.org/docs/2.2.1/configuration.html#application-properties and search for

spark.local.dir

This defaults to /tmp, try and set it to the location of your EBS

NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

Also look at the following stack overflow post for more insightful info

vi_ral
  • 369
  • 4
  • 19