hadoop streaming with python modules

Question

I've seen a technique (on stackoverflow) for executing a hadoop streaming job using zip files to store referenced python modules.

I'm having some errors during the mapping phase of my job's execution. I'm fairly certain it's related to the zip'd module loading.

To debug the script, I have run my data set through sys.stdin/sys.stdout using command line pipes into my mapper and reducer so something like this:

head inputdatafile.txt | ./mapper.py | sort -k1,1 | ./reducer.py

the results look great.

When I run this through hadoop though, I start hitting some problems. ie: the mapper and reducer fail and the entire hadoop job fails completely.

My zip'd module file contains *.pyc files - is that going to impact this thing?

Also where can I find the errors generated during the map/reduction process using hadoop streaming?

I've used the -file command line argument to tell hadoop where the zip'd module is located and where my mapper and reducer scripts are located.

i'm not doing any crazy configuration options to increase the number of mappers and reducers used in the job.

any help would be greatly appreciated! thanks!

it's my first time trying to do this with the zip'd module - as a test, i wrote a really simple mapper and reducer that do nothing but read sys.stdin and write to sys.stdout and they worked just fine. the minute i add in my script that uses the zip'd module, the mapper fails. i figured if it's working through a command line test, and hadoop is streaming with my minimalist test, then it has to be how i'm working with the zip'd module. — ct_, Sep 23 '12 at 13:57
turns out the problem is with the nltk.tokenize module. any help working with that component in a hadoop-streaming context would be greatly appreciated! — ct_, Sep 23 '12 at 22:24

score 0 · Accepted Answer · answered Sep 23 '12 at 22:43

After reviewing sent_tokenize's source code, it looks like the nltk.sent_tokenize AND the nltk.tokenize.sent_tokenize methods/functions rely on a pickle file (one used to do punkt tokenization) to operate.

Since this is Hadoop-streaming, you'd have to figure out where/how to place that pickle file into the zip'd code module that is added into the hadoop job's jar.

Bottom line? I recommend using the RegexpTokenizer class to do sentence and word level tokenization.

hadoop streaming with python modules

1 Answers1