I've seen a technique (on stackoverflow) for executing a hadoop streaming job using zip files to store referenced python modules.
I'm having some errors during the mapping phase of my job's execution. I'm fairly certain it's related to the zip'd module loading.
To debug the script, I have run my data set through sys.stdin/sys.stdout using command line pipes into my mapper and reducer so something like this:
head inputdatafile.txt | ./mapper.py | sort -k1,1 | ./reducer.py
the results look great.
When I run this through hadoop though, I start hitting some problems. ie: the mapper and reducer fail and the entire hadoop job fails completely.
My zip'd module file contains *.pyc files - is that going to impact this thing?
Also where can I find the errors generated during the map/reduction process using hadoop streaming?
I've used the -file command line argument to tell hadoop where the zip'd module is located and where my mapper and reducer scripts are located.
i'm not doing any crazy configuration options to increase the number of mappers and reducers used in the job.
any help would be greatly appreciated! thanks!