AmazonS3Exception Bad Request: distcp from frankfurt s3 to emr hdfs failing

Question

I am trying to copy a file from a s3 bucket from Frankfurt (eu-central-1) onto my hdfs hosted via EMR in Ireland (eu-west-1). The copy commands I tried to execute where:

hdfs dfs -cp "s3a://<bucket>/<file>" /user/hadoop/<file>

and

s3-dist-cp --src "s3a://<bucket>/" --dest hdfs:///user/hadoop/ --srcPattern <file>

and

hadoop distcp "s3a://<bucket>/<file>" /user/hadoop/<file>

In all cases (and various permutations regarding extra options and s3, s3a, s3n on all of those commands) I do get something like the following exception:

16/01/15 11:48:24 ERROR tools.DistCp: Exception encountered
com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 4A77158C1BD71C29), S3 Extended Request ID:     LU41MspxqVnHqyaMreTvggRG480Wb9d+TBx1MAo5v/g9yz07mmPizcZVOtRMQ+GElXs8vl/WZXA=
        at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1219)
        at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:803)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:505)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:317)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3595)
        at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1041)
        at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1013)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:154)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2644)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at org.apache.hadoop.tools.GlobbedCopyListing.doBuildListing(GlobbedCopyListing.java:76)
        at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:84)
        at org.apache.hadoop.tools.DistCp.createInputFileListing(DistCp.java:353)
        at org.apache.hadoop.tools.DistCp.execute(DistCp.java:160)
        at org.apache.hadoop.tools.DistCp.run(DistCp.java:121)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.tools.DistCp.main(DistCp.java:401)

So I figured using s3a uses amazon's sdk behind the scenes. After a lot of research I found the presumed cause:

Link: Amazon s3a returns 400 Bad Request with Spark

Link: https://github.com/aws/aws-sdk-java/issues/360

So in summary Frankfurt and some other other newer centers are 'only' capable of using Signature Version 4, but all of the commands hadoop cp, distcp or s3-dist-cp can do is use Version 2?

Since using s3a is using the aws sdk, I tried to enforce the use of Signature V4 by adding

export JAVA_OPTS="-Dcom.amazonaws.services.s3.enableV4 -Dcom.amazonaws.services.s3.enforceV4"

But to no avail.

This fact made me try doing all the above but with a bucket NOT in eu-central-1 but e.g. on eu-west-1. That worked. So I guess this is it?

Is there a solution to this problem? Anyone experiencing this too?

EDIT

A working alternative is to use aws cli to download the data from s3 onto the master and then use, e.g.

hdfs dfs -put <src> <dst>

to get the job done. But what to do if this is really massive data not fitting on the master node?

Have you tried Hive, create external table with this S3 data and load it directly to hdfs? — rohitkulky, Jan 15 '16 at 15:38
No, I haven't, but I will definitely try (and give feedback). What are your thoughts on why this information could help understand the situation further? — Carsten Blank, Jan 15 '16 at 20:46
@CarstenBlank Did you figure out how to solve or workaround this issue? — rontron, Jun 10 '16 at 09:59
@rontron No I haven't. I, however, have abandoned EMR altogether as I can pull up a hadoop cluster much faster to my likings with only those dependencies/applications I really need using EC2. — Carsten Blank, Jun 26 '16 at 09:56

score 0 · Answer 1 · answered Dec 08 '16 at 10:54

I have managed to get data from s3, FRA into HDFS by specifying the endpoint:

hdfs dfs -Dfs.s3a.awsAccessKeyId=<access key ID> -Dfs.s3a.awsSecretAccessKey=<secret acces key> -Dfs.s3a.endpoint=<s3 enpoint> -ls s3a://<bucket_name>/...

You do not and should not copy it locally.

score 0 · Answer 2 · answered Dec 08 '16 at 13:16

S3a supports frankfurt & seol; it's something we explicitly test against. As lulia notes, you do need to set the endpoint. The hadoop docs for s3a cover this.

Note that once you fix the endpoint to be frankfurt or seol, you can no longer access data in other regions, as they don't relay to them.

AmazonS3Exception Bad Request: distcp from frankfurt s3 to emr hdfs failing

2 Answers2