I want to use multiple outputs in a Hadoop job in Elastic MapReduce. So, I set up MultipleOutputs
in the main()
method like so:
MultipleOutputs.addNamedOutput(hadoopJob, "One",
TextOutputFormat.class, NullWritable.class, Text.class);
MultipleOutputs.addNamedOutput(hadoopJob, "Two",
TextOutputFormat.class, NullWritable.class, Text.class);
I want "one" to contain output from the Mapper, while "two" contains output from the Reducer.
In the setup
method for both the mapper and reducer, I call:
outputWriters = new MultipleOutputs(context);
In the mapper, I call:
outputWriters.write("One", nothing, sampleOutput, "One");
In the reducer, I call:
outputWriters.write("Two", nothing, new Text(thing.getStuff()), "Two");
Finally, in the cleanup
method for both the mapper and reducer, I call:
outputWriters.close();
When I do this, I get a "file already exists" exception from the Reducer - it tries to recreate the output files that were already created by the mapper.
I can solve this by removing outputWriters.close()
from the mapper cleanup
method, but it introduces another problem: I don't get any of the mapper output.
What's the proper way to use MultipleOutputs
with one in the mapper and one in the reducer? The JavaDocs do not mention this situation, and I haven't found anything useful on StackOverflow.
Update: This appears to run fine locally. However, if I try to run it in Elastic MapReduce with S3 output, I run into the "file already exists error." Any ideas on workarounds?