6

I would like to read a file from S3 in my EMR Hadoop job. I am using the Custom JAR option.

I have tried two solutions:

  • org.apache.hadoop.fs.S3FileSystem: throws a NullPointerException.
  • com.amazonaws.services.s3.AmazonS3Client: throws an exception, saying "Access denied".

What I fail to grasp is that I am starting the job from the Console, so obviously I should have the necessary permissions. However, the AWS_*_KEY keys are missing from the environment variables (System.getenv()) that are available to the mapper.

I am sure I do something wrong, just not sure what.

David Nemeskey
  • 640
  • 1
  • 5
  • 16

3 Answers3

4

Probably a little bit late, but... Use InstanceProfileCredentialsProvider for AmazonS3Client.

Martin Frank
  • 3,445
  • 1
  • 27
  • 47
Ivan Konyshev
  • 41
  • 1
  • 3
2

I think that your EMR cluster need to have access to S3, you can create an IAM role for your EMR cluster and give it access to S3. check this link : http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html

SelimN
  • 212
  • 1
  • 2
  • 8
  • This was the right way to go. Without roles, the only solution is to write the access keys directly into the code (or a file in the jar, etc.). Using roles worked without danger of exposing the credentials. – David Nemeskey Jul 17 '14 at 08:17
0

I think the syntax is

hadoop jar your.jar com.your.main.Class -Dfs.s3n.awsAccessKeyId=<access-id> -Dfs.s3n.awsSecretAccessKey=<secrect-key>

Then the path to the common prefix you wish to read should be of the form

s3n://bucket-name/common/prefix/path
samthebest
  • 30,803
  • 25
  • 102
  • 142