3

I'm trying to copy data from a local hadoop cluster to an S3 bucket using distcp.

Sometimes it "works", but some of the mappers fail, with the stack trace below. Other times, so many mappers fail that the whole job cancels.

The error "No space available in any of the local directories." doesn't make sense to me. There is PLENTY of space on the edge node (where the distcp command is running), on the cluster, and in the S3 bucket.

Can anyone shed some light on this?

16/06/16 15:48:08 INFO mapreduce.Job: The url to track the job: <url>
16/06/16 15:48:08 INFO tools.DistCp: DistCp job-id: job_1465943812607_0208
16/06/16 15:48:08 INFO mapreduce.Job: Running job: job_1465943812607_0208
16/06/16 15:48:16 INFO mapreduce.Job: Job job_1465943812607_0208 running in uber mode : false
16/06/16 15:48:16 INFO mapreduce.Job:  map 0% reduce 0%
16/06/16 15:48:23 INFO mapreduce.Job:  map 33% reduce 0%
16/06/16 15:48:26 INFO mapreduce.Job: Task Id : attempt_1465943812607_0208_m_000001_0, Status : FAILED
Error: java.io.IOException: File copy failed: hdfs://<hdfs path>/000000_0 --> s3n://<bucket>/<s3 path>/000000_0
        at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:285)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:253)
        at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying hdfs://<hdfs path>/000000_0 to s3n://<bucket>/<s3 path>/000000_0
        at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
        at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:281)
        ... 10 more
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: No space available in any of the local directories.
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:366)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)
        at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.newBackupFile(NativeS3FileSystem.java:263)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.<init>(NativeS3FileSystem.java:245)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:412)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:986)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:174)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:123)
        at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99)
        at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
        ... 11 more
Zack
  • 301
  • 1
  • 4
  • 9

4 Answers4

2

We ran into a similar exception while trying to save directly the results of a run from Apache Spark (version 1.5.2) to S3. The exception is the same though. I'm not really sure what the core issue is - somehow S3 upload doesn't seem to "play nice" with Hadoop's LocalDirAllocator class (version 2.7).

What finally solved it for us was a combination of:

  1. enabling S3's "fast upload" - by setting "fs.s3a.fast.upload" to "true" in Hadoop configuration. This uses S3AFastOutputStream instead of S3AOutputStream and uploads data directly from memory, instead of first allocating local storage

  2. merging the results of the job to a single part before saving to s3 (in Spark that's called repartitioning/coalescing)

Some caveats though:

  1. S3's fast upload is apparently marked "experimental" in Hadoop 2.7

  2. this work-around only applies to the newer s3a file-system ("s3a://..."). it won't work for the older "native" s3n file-system ("s3n://...")

hope this helps

2

Ideally you should use s3a rather than s3n, as s3n is deprecated.

With s3a, there is a parameter:

<property>
  <name>fs.s3a.buffer.dir</name>
  <value>${hadoop.tmp.dir}/s3a</value>
  <description>Comma separated list of directories that will be used to buffer file
uploads to. No effect if fs.s3a.fast.upload is true.</description>
</property>

When you are getting the local file error, it most likely because the buffer directory has no space.

While you can change this setting to point at a directory with more space, a better solution may be to set (again in S3a):

fs.s3a.fast.upload=true

This avoids buffering the data on local disk and should actually be faster too.

The S3n buffer directory parameter should be:

fs.s3.buffer.dir

So if you stick with s3n, ensure it has plenty of space and it should hopefully resolve this issue.

Stephen ODonnell
  • 4,441
  • 17
  • 19
  • fs.s3a.fast.upload = true does default to using local HDD on 2.8+, as the pure on-heap version of Hadoop 2.7 was always running out of memory if more data was being written than could be uploaded...this would invariably happen at the wrong time. But: it will only use disk for those blocks being actively written or uploaded, rather than buffer the entire file. – stevel Sep 14 '17 at 09:46
0

I had this error for some days and did not get what was happening, all nodes have PLENTY space (around 400GB). After some research I found this: 2019-01-09 17:31:30,326 WARN [main] org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext: Failed to create /mnt/hadoop/tmp/s3a

The exception says about space but the real error is permission, the message could be improved.

Bruno Manzo
  • 353
  • 3
  • 15
0

I had the same problem using Hadoop 2.8.5, but setting "fs.s3a.fast.upload" to "true" did not solve the problem. I also had to set fs.s3a.fast.upload.buffer to "bytebuffer". The default setting of fs.s3a.fast.upload.buffer is "disk", which explains why I continued to get the same error. There is also an "array" setting, but I did not try that.

The available fs.s3a.fast.upload.buffer settings are:

  1. bytebuffer buffered to JVM off-heap memory.

  2. array buffered to JVM on-heap memory.

  3. disk [DEFAULT] buffered to local hard disks.

There are caveats for each which are explained on the Hadoop sites linked above.

Example pySpark code below.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

hdpConf = sc._jsc.hadoopConfiguration()
user = os.getenv("USER")
hdpConf.set("hadoop.security.credential.provider.path", "jceks://hdfs/user/{}/awskeyfile.jceks".format(user))
hdpConf.set("fs.s3a.fast.upload", "true")
hdpConf.set("fs.s3a.fast.upload.buffer", "bytebuffer")
Clay
  • 2,584
  • 1
  • 28
  • 63