3

I want to output gzip and lzo formats at the same time in one job.

I used MultipleOutputs, and add two named outputs like this:

MultipleOutputs.addNamedOutput(job, "LzoOutput", GBKTextOutputFormat.class, Text.class, Text.class);

GBKTextOutputFormat.setOutputCompressorClass(job, LzoCodec.class);

MultipleOutputs.addNamedOutput(job, "GzOutput", TextOutputFormat.class, Text.class, Text.class);

TextOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

(GBKTextOutputFormat here is written by myself which extends FileOutputFormat)

They are used in reducer like:

multipleOutputs.write("LzoOutput", NullWritable.get(), value, "/user/hadoop/lzo/"+key.toString());

multipleOutputs.write("GzOutput", NullWritable.get(), value, "/user/hadoop/gzip/"+key.toString());

The result is:

I can get outputs in the two path, but they are both in gzip format.

Someone can help me? Thanks!

==========================================================================

More:

I just looked at the source code of setOutputCompressorClass in FileOutputFormat, in which conf.setClass("mapred.output.compression.codec", codecClass, CompressionCodec.class);

It seems that mapred.output.compression.codec in configuration will be reset when setOutputCompressorClass is called.

So the actual compression format is the one we set at last, and we cannot set two different compression formats in the same job ? Or there is something else ignored ?

thomaslee
  • 407
  • 1
  • 7
  • 21
  • Have you confirmed that your GBKTextOutputFormat works when used as the only output format type in a reducer that isn't running MultipleOutput? Also that in your custom output format class the compression class is set to something other than GzipCodec in the getRecordWriter() method? – Chris Gerken Oct 18 '12 at 22:32
  • I meant the default compression class... – Chris Gerken Oct 18 '12 at 22:41

1 Answers1

2

So maybe as a work-around, try setting the correct outputCompressorClass directly in the configuration

context.getConfiguration().setOutputCompressorClass(GzipCodec.class);

just before your write call to each of the outputs. It does look like any output format configuration parameters other than key class, value class and output path are not handled well by MultipleOutputs and we may have to write a bit of code to offset that oversight.

Chris Gerken
  • 16,221
  • 6
  • 44
  • 59
  • Thanks for your reply! GBKTextOutputFormat actually works well when there is only one output format type. I will try to setOutputCompressorClass before each `multipleOutputs.write(...)`. – thomaslee Oct 19 '12 at 01:59
  • 1
    I'm looking forward to hearing how it goes. The MultipleOutputs object you instantiate actually caches a RecordWriter for each output and that RecordWriter is constructed the first time you try to write something to the output. The problem as we've discussed is that different outputs step on each other's configuration settings and then the RecordWriters are not created correctly. In theory :) – Chris Gerken Oct 19 '12 at 02:04
  • The RecordWriter is cached in MultipleOutputs object exactly as your point. The context of RecordWriter is obtained by taskContext = getContext(...);, which directly return taskContext if not equal to null. That makes our setOutputCompressorClass before MultipleOutputs.write do not work. So I copy MultipleOutputs as another class and modify the getContext() function. It will not return if taskContext != null. Everything seems OK now, and what I should do next is to ensure taskContext to be instantiated only once for each nameOutput, or it will run too slowly. Thank you very much!!! – thomaslee Oct 19 '12 at 03:51
  • I hadn't thought of creating a new MultipleOutputs-like class. Very cool. – Chris Gerken Oct 19 '12 at 03:57
  • 1
    @thomaslee: see my answer to this question: http://stackoverflow.com/questions/12981233/hadoop-multipleoutputs-does-not-write-to-multiple-files-when-file-formats-are-cu – Chris Gerken Oct 22 '12 at 20:09
  • Nice! That helps me output to different directories that unknown in advance. Thank you!@Chris – thomaslee Oct 23 '12 at 04:19