I have a quick Hadoop Streaming question. If I'm using Python streaming and I have Python packages that my mappers/reducers require but aren't installed by default do I need to install those on all the Hadoop machines as well or is there some sort of serialization that sends them to the remote machines?
Asked
Active
Viewed 1,095 times
2 Answers
2
If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here's a Haddop 0.17 invocation:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.17.0-streaming.jar -mapper mapper.py -reducer reducer.py -input input/foo -output output -file /tmp/foo.py -file /tmp/lib.zip
However, see this issue for a caveat:

Karl Anderson
- 1,798
- 11
- 18
1
If you use Dumbo you can use -libegg to distribute egg files and auto-configure the Python runtime:
https://github.com/klbostee/dumbo/wiki/Short-tutorial#wiki-eggs_and_jars https://github.com/klbostee/dumbo/wiki/Configuration-files

Joe Futrelle
- 11
- 1