Hadoop: How to output different format types in the same job? (part II)

Question

I would like to write compressed and uncrompressed files within the same reducer using MultipleOutputs, but it seems to be an all or nothing. If I do:

    MultipleOutputs.addNamedOutput(job, "ToGzip", TextOutputFormat.class, NullWritable.class, Text.class);
    TextOutputFormat.setCompressOutput(job, true);
    TextOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

It will compress everything, not only the files that I want. If you look at this very similar question:

Hadoop: How to output different format types in the same job?

You will see that it will fix my problem, but it uses the old interface and the new one does not have:

context.getConfiguration().setOutputCompressorClass(GzipCodec.class);

What would be the equivalent solution with the new Hadoop API ?

score 1 · Accepted Answer · answered Nov 24 '15 at 02:14

1

Short answer is, I don't think you can right now.

Longer answer/rant. Multiple outputs in Hadoop are a mess. Add in HBase and it gets really messy. The multiple output "feature" that exists today seem more like a fragile hack that is "good enough". Since options are usually job scoped, there is little granular control over individual outputs.

If you need output specific compression then your best bet is to create your own OutputFormat by extending an existing one.

answered Nov 24 '15 at 02:14

Andrew White

52,720
19
113
137

Thanks for your answer. I went ahead and trivially extended TextOutputFormat to compress the output. I am accepting your answer because of your good advice. – Javier Nov 24 '15 at 22:28

Hadoop: How to output different format types in the same job? (part II)

1 Answers1