0

I have a quick Hadoop Streaming question. If I'm using Python streaming and I have Python packages that my mappers/reducers require but aren't installed by default do I need to install those on all the Hadoop machines as well or is there some sort of serialization that sends them to the remote machines?

tenshi
  • 26,268
  • 8
  • 76
  • 90
James
  • 15,085
  • 25
  • 83
  • 120
  • This question show how to import nltk on each node. http://stackoverflow.com/questions/6811549/how-can-i-include-a-python-package-with-hadoop-streaming-job/6811775#6811775 – viper Nov 04 '13 at 18:08

2 Answers2

2

If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here's a Haddop 0.17 invocation:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.17.0-streaming.jar -mapper mapper.py -reducer reducer.py -input input/foo -output output -file /tmp/foo.py -file /tmp/lib.zip

However, see this issue for a caveat:

https://issues.apache.org/jira/browse/MAPREDUCE-596

Karl Anderson
  • 1,798
  • 11
  • 18
1

If you use Dumbo you can use -libegg to distribute egg files and auto-configure the Python runtime:

https://github.com/klbostee/dumbo/wiki/Short-tutorial#wiki-eggs_and_jars https://github.com/klbostee/dumbo/wiki/Configuration-files