How to read a file from s3 in EMR?

Question

I would like to read a file from S3 in my EMR Hadoop job. I am using the Custom JAR option.

I have tried two solutions:

org.apache.hadoop.fs.S3FileSystem: throws a NullPointerException.
com.amazonaws.services.s3.AmazonS3Client: throws an exception, saying "Access denied".

What I fail to grasp is that I am starting the job from the Console, so obviously I should have the necessary permissions. However, the AWS_*_KEY keys are missing from the environment variables (System.getenv()) that are available to the mapper.

I am sure I do something wrong, just not sure what.

score 4 · Answer 1 · edited May 23 '16 at 06:17

4

Probably a little bit late, but... Use InstanceProfileCredentialsProvider for AmazonS3Client.

edited May 23 '16 at 06:17

Martin Frank

3,445
1
27
47

answered May 22 '16 at 21:53

Ivan Konyshev

41
1
3

1

Why did this get a down vote is beyond me, very helpful. – Harel Gliksman Aug 07 '16 at 15:26
This is life saving – belka Apr 09 '19 at 14:38

score 2 · Accepted Answer · answered Jun 19 '14 at 13:24

2

I think that your EMR cluster need to have access to S3, you can create an IAM role for your EMR cluster and give it access to S3. check this link : http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html

answered Jun 19 '14 at 13:24

SelimN

212
1
2
8

This was the right way to go. Without roles, the only solution is to write the access keys directly into the code (or a file in the jar, etc.). Using roles worked without danger of exposing the credentials. – David Nemeskey Jul 17 '14 at 08:17

score 0 · Answer 3 · answered Jun 13 '14 at 09:21

0

I think the syntax is

hadoop jar your.jar com.your.main.Class -Dfs.s3n.awsAccessKeyId=<access-id> -Dfs.s3n.awsSecretAccessKey=<secrect-key>

Then the path to the common prefix you wish to read should be of the form

s3n://bucket-name/common/prefix/path

answered Jun 13 '14 at 09:21

samthebest

30,803
25
102
142

I am running the JAR on EMR. I do not have a `hadoop` command there, as far as I know. – David Nemeskey Jun 13 '14 at 12:37
EMR sucks ... get a devops to build you a proper EC2 cluster :) @DavidNemeskey – samthebest Jun 13 '14 at 17:44
2

`hadoop` command is present in EMR instances (at least in version 5.3.0) – ssedano Feb 12 '17 at 18:01

How to read a file from s3 in EMR?

3 Answers3