5

I read Hadoop in Action and found that in Java using MultipleOutputFormat and MultipleOutputs classes we can reduce the data to multiple files but what I am not sure is how to achieve the same thing using Python streaming.

for example:

                  / out1/part-0000
mapper -> reducer   
                  \ out2/part-0000

If anyone knows, heard, done similar thing, please let me know

daydreamer
  • 87,243
  • 191
  • 450
  • 722

1 Answers1

2

The Dumbo Feathers, a set of java classes to use together with Dumbo (a python library that makes it easy to write efficient python M/R programs for hadoop), does this in its output classes.

Basically, in your python dumbo M/R job, you output a key that is a tuple of two elements - the first element being the name of the directory to output to, the second element being the actual key. The output class you've selected then inspects the tuple to find what output directory to use, and use MultipleOutputFormat to write to different subdirectories.

With dumbo, this is easy due to the use of typedbytes as output format, but I think it should be doable even if you have other output formats.

Erik Forsberg
  • 4,819
  • 3
  • 27
  • 31
  • How do i use it? just download the jar, give "-libjar feathers.jar" without affecting any map/reduce job I wrote until now? any sample test code that I can refer to run using this would be helpful – daydreamer Sep 29 '11 at 19:31