I am trying to copy a file from a s3 bucket from Frankfurt (eu-central-1) onto my hdfs hosted via EMR in Ireland (eu-west-1). The copy commands I tried to execute where:
hdfs dfs -cp "s3a://<bucket>/<file>" /user/hadoop/<file>
and
s3-dist-cp --src "s3a://<bucket>/" --dest hdfs:///user/hadoop/ --srcPattern <file>
and
hadoop distcp "s3a://<bucket>/<file>" /user/hadoop/<file>
In all cases (and various permutations regarding extra options and s3, s3a, s3n on all of those commands) I do get something like the following exception:
16/01/15 11:48:24 ERROR tools.DistCp: Exception encountered
com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 4A77158C1BD71C29), S3 Extended Request ID: LU41MspxqVnHqyaMreTvggRG480Wb9d+TBx1MAo5v/g9yz07mmPizcZVOtRMQ+GElXs8vl/WZXA=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1219)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:803)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:505)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:317)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3595)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1041)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1013)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2644)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.tools.GlobbedCopyListing.doBuildListing(GlobbedCopyListing.java:76)
at org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:84)
at org.apache.hadoop.tools.DistCp.createInputFileListing(DistCp.java:353)
at org.apache.hadoop.tools.DistCp.execute(DistCp.java:160)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:121)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:401)
So I figured using s3a uses amazon's sdk behind the scenes. After a lot of research I found the presumed cause:
Link: Amazon s3a returns 400 Bad Request with Spark
Link: https://github.com/aws/aws-sdk-java/issues/360
So in summary Frankfurt and some other other newer centers are 'only' capable of using Signature Version 4, but all of the commands hadoop cp, distcp or s3-dist-cp can do is use Version 2?
Since using s3a is using the aws sdk, I tried to enforce the use of Signature V4 by adding
export JAVA_OPTS="-Dcom.amazonaws.services.s3.enableV4 -Dcom.amazonaws.services.s3.enforceV4"
But to no avail.
This fact made me try doing all the above but with a bucket NOT in eu-central-1 but e.g. on eu-west-1. That worked. So I guess this is it?
Is there a solution to this problem? Anyone experiencing this too?
EDIT
A working alternative is to use aws cli to download the data from s3 onto the master and then use, e.g.
hdfs dfs -put <src> <dst>
to get the job done. But what to do if this is really massive data not fitting on the master node?