0

I want to use multiple outputs in a Hadoop job in Elastic MapReduce. So, I set up MultipleOutputs in the main() method like so:

MultipleOutputs.addNamedOutput(hadoopJob, "One",
    TextOutputFormat.class, NullWritable.class, Text.class);

MultipleOutputs.addNamedOutput(hadoopJob, "Two",
    TextOutputFormat.class, NullWritable.class, Text.class);

I want "one" to contain output from the Mapper, while "two" contains output from the Reducer.

In the setup method for both the mapper and reducer, I call:

outputWriters = new MultipleOutputs(context);

In the mapper, I call:

outputWriters.write("One", nothing, sampleOutput, "One");

In the reducer, I call:

outputWriters.write("Two", nothing, new Text(thing.getStuff()), "Two");

Finally, in the cleanup method for both the mapper and reducer, I call:

outputWriters.close();

When I do this, I get a "file already exists" exception from the Reducer - it tries to recreate the output files that were already created by the mapper.

I can solve this by removing outputWriters.close() from the mapper cleanup method, but it introduces another problem: I don't get any of the mapper output.

What's the proper way to use MultipleOutputs with one in the mapper and one in the reducer? The JavaDocs do not mention this situation, and I haven't found anything useful on StackOverflow.

Update: This appears to run fine locally. However, if I try to run it in Elastic MapReduce with S3 output, I run into the "file already exists error." Any ideas on workarounds?

John Chrysostom
  • 3,973
  • 1
  • 34
  • 50

0 Answers0