5

I am trying to set up a trivial EMR job to perform word counting of massive text files, stored in s3://__mybucket__/input/. I am unable to correctly add the first of the two required streaming steps (the first is map input to wordSplitter.py, reduce with an IdentityReducer to temporary storage; second step is map the contents of this secondary storage using /bin/wc/, and reduce with an IdentityReducer yet again).

This is the (failure) description of the first step:

Status:FAILED
Reason:S3 Service Error.
Log File:s3://aws-logs-209733341386-us-east-1/elasticmapreduce/j-2XC5AT2ZP48FJ/steps/s-1SML7U7CXRDT5/stderr.gz
Details:Exception in thread "main" com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 7799087FCAE73457), S3 Extended Request ID: nQYTtW93TXvi1G8U4LLj73V1xyruzre+uSt4KN1zwuIQpwDwa+J8IujOeQMpV5vRHmbuKZLasgs=
JAR location: command-runner.jar
Main class: None
Arguments: hadoop-streaming -files s3://elasticmapreduce/samples/wordcount/wordSplitter.py -mapper wordSplitter.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input s3://__mybucket__/input/ -output s3://__mybucket__/output/
Action on failure: Continue

This is the command being sent to the hadoop cluster:

JAR location : command-runner.jar
Main class : None
Arguments : hadoop-streaming -mapper s3a://elasticmapreduce/samples/wordcount/wordSplitter.py -reducer aggregate -input s3a://__my_bucket__/input/ -output s3a://__my_bucket__/output/
Skyler
  • 2,834
  • 5
  • 22
  • 34

1 Answers1

2

I think the solution here is likely very easy.

Instead of s3:// use s3a:// as a scheme for your job accessing the bucket. See here, the s3:// scheme is deprecated and requires the bucket in question to be exclusive to your Hadoop data. Quote from the above doc link:

This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

Armin Braun
  • 3,645
  • 1
  • 17
  • 33
  • I got the same error. Bad Request `Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 4C4C5B2DBA095F5C` – Skyler Dec 27 '16 at 16:31
  • @Skyler hmm maybe to help me narrow this down. Are you able to copy said file to your HDFS via distcp ? – Armin Braun Dec 27 '16 at 17:35
  • I have no idea what that is. I am not interfacing with Hadoop directly. I'm using the AWS console – Skyler Dec 29 '16 at 19:35
  • @Skyler oh, can you share the code for interacting with the AWS console (best if you add it to the question before the error imo)? – Armin Braun Dec 29 '16 at 19:46