7

I am trying to change the location spark writes temporary files to. Everything I've found online says to set this by setting the SPARK_LOCAL_DIRS parameter in the spark-env.sh file, but I am not having any luck with the changes actually taking effect.

Here is what I've done:

  1. Created a 2-worker test cluster using Amazon EC2 instances. I'm using spark 2.2.0 and the R sparklyr package as a front end. The worker nodes are spun up using an auto scaling group.
  2. Created a directory to store temporary files in at /tmp/jaytest. There is one of these in each worker and one in the master.
  3. Puttied into the spark master machine and the two workers, navigated to home/ubuntu/spark-2.2.0-bin-hadoop2.7/conf/spark-env.sh, and modified the file to contain this line: SPARK_LOCAL_DIRS="/tmp/jaytest"

Permissions for each of the spark-env.sh files are -rwxr-xr-x, and for the jaytest folders are drwxrwxr-x.

As far as I can tell this is in line with all the advice I've read online. However, when I load some data into the cluster it still ends up in /tmp, rather than /tmp/jaytest.

I have also tried setting the spark.local.dir parameter to the same directory, but also no luck.

Can someone please advise on what I might be missing here?

Edit: I'm running this as a standalone cluster (as the answer below indicates that the correct parameter to set depends on the cluster type).

jay
  • 517
  • 1
  • 7
  • 19

2 Answers2

1

As per the spark documentation it is clearly saying that if you have configured Yarn Cluster manager then it will be overwrite the spark-env.sh setting. Can you just check in Yarn-env or yarn-site file for the local dir folder setting.

"this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager." source - https://spark.apache.org/docs/2.3.1/configuration.html

Vijay L
  • 11
  • 3
  • Thanks Vijay. I'm running a standalone cluster, so as per the documentation I'm trying to set the `SPARK_LOCAL_DIRS` parameter rather than the `spark.local.dir` parameter (since the former will overwrite the latter). – jay Aug 29 '18 at 21:12
1

Mac env, spark-2.1.0, and spark-env.sh contains:

export SPARK_LOCAL_DIRS=/Users/kylin/Desktop/spark-tmp

Using spark-shell, it works.

Did you use the right format?

kylin
  • 21
  • 1
  • Thanks kylin, tried adding `export SPARK_LOCAL_DIRS=/path/to/dir` to the conf file but didn't work using sparklyr (I'm not using the shell - need a solution that works via sparklyr). What do you mean by 'using the right format'? – jay Sep 07 '18 at 00:27