How do I set up a distributed map-reduce job using hadoop streaming and ruby mappers/reducers?

Question

I'm able to run a local mapper and reducer built using ruby with an input file.

I'm unclear about the behavior of the distributed system though.

For the production system, I have a HDFS set up across two machines. I know that if I store a large file on the HDFS, it will have some blocks on both machines to allow for parallelization. Do I also need to store the actual mappers and reducer files (my ruby files in this case) on the HDFS as well?

Also, how would I then go about actually running the streaming job so that it runs in a parallel manner on both systems?

score 1 · Accepted Answer · answered May 01 '12 at 03:45

If you were to use mapper/reducers written in ruby (or anything other than Java), you would have to use hadoop-streaming. Hadoop streaming has an option to package your mapper/reducer files when sending your job to the cluster. The following link should have what you are looking for.

http://hadoop.apache.org/common/docs/r0.15.2/streaming.html

How do I set up a distributed map-reduce job using hadoop streaming and ruby mappers/reducers?

1 Answers1