how to run a mapreduce job on amazon's elastic mapreduce (emr) cluster from windows?

Question

i'm trying to learn how to run a java Map/Reduce (M/R) job on amazon's EMR. the documentation that i am following is here http://aws.amazon.com/articles/3938. i am on a windows 7 computer.

when i try to run this command, i am shown the help information.

./elasticmapreduce-client.rb RunJobFlow streaming_jobflow.json

of course, since i am on a windows machine, i actually type in this command. i am not sure why, but for this particular command, there was not a windows version (all commands where shown in pairs, one for *nix and one for windows).

 ruby elastic-mapreduce RunJobFlow my_job.json

my question is how do we submit/run a job from windows to amazon's EMR using the command line interface (on windows)? i've tried searching online, but i get taken to wild places. any help is appreciated.

thanks.

score 1 · Answer 1 · answered Jun 19 '12 at 21:18

To run a streaming job on EMR, first you will need to create a cluster by a command like :

ruby elastic-mapreduce --create --alive --plain-output --master-instance-type m1.xlarge 
--slave-instance-type m1.xlarge --num-instances 6  --name "Some Job Cluster" --bootstrap-action s3://<path-to-a-bootstrap-script>

This would return a jobid, which would look something like : j-ABCD7EF763

Now you can submit you job step by following command:

ruby elastic-mapreduce -j j-ABCD7EF763 --stream --step-name "my step name" --mapper
s3://<some-path>/mapper-script.rb --reducer s3://<some=path>/reducer-script.rb --input 
s3://<input-path> --output s3://<output-path>

You can also direct run a job instead of running a streaming job, in which case the cluster will terminate itself when the job ends.

score 1 · Answer 2 · edited Jun 22 '12 at 00:51

1

Try using the --json option.

e.g. ./elastic-mapreduce --create --name Multisteps --json wordcount_jobflow.json

You will need to trim your json file with only the Steps (removing everything outside the []). There is a thread discussing that: https://forums.aws.amazon.com/thread.jspa?threadID=35093

edited Jun 22 '12 at 00:51

Jon Lin

142,182
29
220
220

answered Jun 21 '12 at 23:21

Tony Fu

11
1

score 1 · Answer 3 · answered Mar 14 '12 at 23:43

Hmmm. I'm not sure how old the example with RunJobFlow is... I'd personally ignore it.

Are you able to run?

localhost$ elastic-mapreduce --describe

Once you can then you should play directly on a cluster to shake out the exact steps you need to do... It's worth doing this so you don't have to start/stop a cluster a bazillion times.

localhost$ elastic-mapreduce --create --alive --num-instances 1
localhost$ elastic-mapreduce -j j-YOUR_ID_HERE --ssh

cluster$ hadoop jar my.jar -D some=1 -D args=1 blah blah
cluster$ hadoop jar some_other_jar.jar -D foo -D bar
cluster$ ^D

localhost$ elastic-mapreduce -j j-YOUR_ID_HERE --terminate

Then when you're happy with the steps and you need to have it run headless (say, from cron) you can have the EMR orchestrate the steps (including the cluster self terminating at the end)

localhost$ elastic-mapreduce --create --num-instances 1
localhost$ elastic-mapreduce --jar my_jar.jar --args "-D,some=1,-D,args=1,blah,blah"
localhost$ elastic-mapreduce --jar some_other_jar.jar --args "-D,foo,-D,bar"

I'd only explore the --json stuff if you need more complex steps, it's a bit cryptic and hard to get right first time...

how to run a mapreduce job on amazon's elastic mapreduce (emr) cluster from windows?

3 Answers3

Linked