How to make spark save it's temp files on S3?

Question

I am running spark jobs on the AWS EMR cluster submitting them from the client host machine. Client machine is just an EC2 instance that submits jobs to the EMR with yarn in cluster mode.

The problem is - spark saves temp files each of 200Mb like:

/tmp/spark-456184c9-d59f-48f4-9b0560b7d310655/__spark_conf__6943938018805427428.zip

Tmp folder is getting filled with such files very fast and I start getting failed jobs with the error:

No space left on device

I tried to configure spark.local.dir in spark-defaults.conf to point to my s3 bucket, but it adds user directory prefix to the path like this: /home/username/s3a://my-bucket/spark-tmp-folder

Could you please suggest how I can fix this problem?

score 0 · Accepted Answer · answered Oct 03 '18 at 13:31

I uploaded the zip archive __spark_conf__6943938018805427428.zip
with spark libs to the s3 bucket.
Then I specified it in the spark-defaults.conf in the property
spark.yarn.archive s3a://mybucket/libs/spark_libs.zip on my
client host machine that submits jobs.
Now spark loads only configs to the local tmp folder that takes
only 170Kb instead of 200Mb.

How to make spark save it's temp files on S3?

1 Answers1

Linked