MultipleOutputs in Apache Hadoop 0.20.203

Question

Possible Duplicate:
MultipleOutputFormat in hadoop

How are users of Apache Hadoop 0.20.203 dealing with the lack of support for MultipleOutputs (reducers writing to multiple output files)?

Older versions of Apache Hadoop support MultipleOutputs, but to use them it seems one must use deprecated APIs.

I've also heard that the certain Cloudera Distributions of Hadoop support a more recent MultipleOutputs API, as defined at http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html, which is supposed to come out in the 0.21 Apache release of Hadoop.

Cheers!

Even I am facing the same problem, but I was not sure if MultipleOutputs support is not there in 0.20.203. As per my understanding, the support is there but it is just that there is a difference in argument types. For eg, in 0.20.X, MultipleOutputs() takes JobConf as argument. So, how did you finally get over this problem? — Piyush Kansal, Feb 25 '12 at 00:47

score 1 · Answer 1 · answered Jun 14 '11 at 15:40

First, have you considered trying to backport MultipleOutputs to the version of Hadoop that you're running? I don't know how hard this would be, but I've had some success backporting things like bug fixes in CombineFileInputFormat.

Without MultipleOutputs, it's possible to achieve something similar by writing a custom Partitioner to place keys into a pre-determined set of buckets, and forcing the number of reduce tasks to be equal to the number of buckets.

I'll try to make this more concrete with an example similar to what's in the JavaDocs you linked for MultipleOutputs. In that example, the reducer wrote to 2 pre-determined named outputs: "text" and "seq". Knowing at job submission time that there are exactly 2 outputs, we submit the job with number of reduce tasks set to 2. For each key-value pair that the mapper receives, it must write 2 output key-value pairs: one with "text" as part of the key and one with "seq" as part of the key. Then, in the custom partitioner, we can do something like:

if (key.toString().equals("text"))
    return 0;
else if (key.toString().equals("seq"))
    return 1;

Then, assuming a no-op IdentityReducer, we know that the contents of part-r-00000 will have all of the "text" records and part-r-00001 will have all of the "seq" records. It's vitally important that the job runs with 2 reducer tasks. (If there was only one reducer task, then it would just combine "text" and "seq" records into part-r-00000.)

Notice that I've skipped the third named output from the MultipleOutputs example. That's much harder to solve, because the name must be determined at runtime. This solution only works if you know a pre-determined set of names at job submission time.

Fair warning: this entire solution is very brittle. If the number of names changes, then you must change the number of reducer tasks to match it. Depending on the nature of your problem, it may be possible to dynamically detect all possible keys prior to job submission and dynamically adjust number of reduce tasks accordingly. It also takes more effort to scale the solution out to multiple reduce tasks. All things considered, this solution can be difficult to maintain, but it's the only way I know how to solve it without MultipleOutputs.

MultipleOutputs in Apache Hadoop 0.20.203

1 Answers1