1

I am running spark jobs on the AWS EMR cluster submitting them from the client host machine. Client machine is just an EC2 instance that submits jobs to the EMR with yarn in cluster mode.

The problem is - spark saves temp files each of 200Mb like:

/tmp/spark-456184c9-d59f-48f4-9b0560b7d310655/__spark_conf__6943938018805427428.zip

Tmp folder is getting filled with such files very fast and I start getting failed jobs with the error:

No space left on device

I tried to configure spark.local.dir in spark-defaults.conf to point to my s3 bucket, but it adds user directory prefix to the path like this: /home/username/s3a://my-bucket/spark-tmp-folder

Could you please suggest how I can fix this problem?

Boris Mitioglov
  • 1,092
  • 4
  • 16
  • 32

1 Answers1

0
  • I uploaded the zip archive __spark_conf__6943938018805427428.zip
    with spark libs to the s3 bucket.
  • Then I specified it in the spark-defaults.conf in the property
    spark.yarn.archive s3a://mybucket/libs/spark_libs.zip on my
    client host machine that submits jobs.
  • Now spark loads only configs to the local tmp folder that takes
    only 170Kb instead of 200Mb.
Boris Mitioglov
  • 1,092
  • 4
  • 16
  • 32