How to pass arguments to streaming job on Amazon EMR

Question

I want to produce the output of my map function, filtering the data by dates.

In local tests, I simply call the application passing the dates as parameters as:

cat access_log | ./mapper.py 20/12/2014 31/12/2014 | ./reducer.py

Then the parameters are taken in the map function

#!/usr/bin/python
date1 = sys.argv[1];
date2 = sys.argv[2];

The question is: How do I pass the date parameters to the map calling on Amazon EMR?

I am a beginner in Map reduce. Will appreciate any help.

ohad edelstain · Answer 1 · 2015-05-21T07:32:13.150

First of all, When you run a local test, and you should as often as possible. the correct format (in order to reproduce how map-reduce works) is:

cat access_log | ./mapper.py 20/12/2014 31/12/2014 | sort | ./reducer.py | sort

That the way the hadoop framework works.
If you are looking on a big file, you should do it in steps to verify results of each line.
meaning:

cat access_log | ./mapper.py 20/12/2014 31/12/2014 > map_result.txt
cat map_result.txt | sort > map_result_sorted.txt
cat map_result_sorted.txt | ./reducer.py > reduce_result.txt
cat reduce_result.txt | sort > map_reduce_result.txt

In regard to your main question:
Its the same thing.

If you are going to use the amazon web console to create your cluster, in the add step window you just write as fallowing:

name: learning amazon emr
Mapper: (here they say: please give us s3 path to your mapper, we will ignore that, and just write our script name and parameters, no backslash...) mapper.py 20/12/2014 31/12/2014
Reducer: (the same as in the mapper) reducer.py (you can add here params too)
Input location: ...
Output location: ... (just remember to use a new output every time, or your task will fail)
Arguments: -files s3://cod/mapper.py,s3://cod/reducer.py (use your file path here, even if you add only one file use the -files argument)

That's it

If you are going into the all argument thing, i suggest you see this guy blog on how to use the passing of arguments in order to use only a single map,reduce file.

Hope it helped

Thank you for the answer. I tried both ways that you mentioned, 1-passing parameters in the Mapper and Reducer directly. 2-Specifiyng the files in the arguments. It turns out the result, in both cases, was that the hadoop could not find the executable 'mappername.py 20/12/2014 31/12/2014'. It might work with java though, but I think it is not possible to do with streaming jobs (python). Thank you for your time. — FelipeGTX, May 22 '15 at 11:23
I didnt write two ways :), just the one way, and i am working with python and it works for me. again, in the mapper/reducer part you write your command with the parameters you want, and in the argument part you write the path to the file. i promise you it works, i can also promise you that like you i spent a few good hours of trial and error before i got it to work. — ohad edelstain, May 22 '15 at 14:02

How to pass arguments to streaming job on Amazon EMR

1 Answers1