How do I specify a S3 bucket as my input to EMR

Question

Instead of copying over to HDFS, is it possible to just get an array of objects in a bucket in S3 to be processed in EMR?

I've tried this and I keep on either getting security warnings for not having credentials (even after I add them to the configs) (this is from just doing new Path("s3n://...")) or running the jar tells me I am missing the AWS sdk when I try to use the AWS sdk to access my bucket.

Are you using EMR? If yes is the S3 account under the same aws account? If yes, you shouldn't need to provide any security credentials. An example command should look like: `ruby elastic-mapreduce --jobflow --jar s3:///myJob.jar --arg s3:// --arg s3:// --step-name "My Job"` — Amar, Aug 14 '13 at 10:00
@Amar What if the S3 account is not under the same aws account. How do you specify the security credentials in that case? — Abhishek Jain, Aug 23 '13 at 09:51
I am not sure this would work or not, but try this : `s3://:@`, something like `s3://RYWX12N9WCY42XVOL8WH:Xqj1%2FNMvKBhl1jqKlzbYJS66ua0e8z7Kkvptl9bv@mybucket/dest` — Amar, Aug 23 '13 at 20:44

score 0 · Answer 1 · edited Aug 21 '14 at 07:53

0

You can add it in the arguments section

While adding it as step select CustomJAR

JAR location: s3://inbsightshadoop/jar/loganalysis.jar
Main class: None
Arguments: s3://inbsightshadoop/insights-input s3://inbsightshadoop/insights-output
Action on failure: Terminate cluster

edited Aug 21 '14 at 07:53

NorthCat

9,643
16
47
50

answered Aug 21 '14 at 07:31

user3652630

21
1
3

How do I specify a S3 bucket as my input to EMR

1 Answers1