Is there any standard way in hadoop streaming to handle dependencies similar to the DistributedCache(in java MR)
Say for example i have a python module to be used in all map task. How i can achieve it?
Is there any standard way in hadoop streaming to handle dependencies similar to the DistributedCache(in java MR)
Say for example i have a python module to be used in all map task. How i can achieve it?
you can use the -file argument to specify the python module:
see http://hadoop.apache.org/docs/r0.18.3/streaming.html
you can specify multiple -file arguments if you have dependency modules and such.