Managing dependencies with Hadoop Streaming?

Question

I have a quick Hadoop Streaming question. If I'm using Python streaming and I have Python packages that my mappers/reducers require but aren't installed by default do I need to install those on all the Hadoop machines as well or is there some sort of serialization that sends them to the remote machines?

This question show how to import nltk on each node. http://stackoverflow.com/questions/6811549/how-can-i-include-a-python-package-with-hadoop-streaming-job/6811775#6811775 — viper, Nov 04 '13 at 18:08

score 2 · Accepted Answer · answered May 19 '10 at 22:44

If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here's a Haddop 0.17 invocation:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.17.0-streaming.jar -mapper mapper.py -reducer reducer.py -input input/foo -output output -file /tmp/foo.py -file /tmp/lib.zip

However, see this issue for a caveat:

https://issues.apache.org/jira/browse/MAPREDUCE-596

score 1 · Answer 2 · answered Mar 15 '12 at 20:56

1

If you use Dumbo you can use -libegg to distribute egg files and auto-configure the Python runtime:

https://github.com/klbostee/dumbo/wiki/Short-tutorial#wiki-eggs_and_jars https://github.com/klbostee/dumbo/wiki/Configuration-files

answered Mar 15 '12 at 20:56

Joe Futrelle

11
1

Managing dependencies with Hadoop Streaming?

2 Answers2

Linked