0

I am using hadoop cdh4.1.2, and my mapper program is almost a echo of input data. But in my job status page, I saw

FILE: Number of bytes written  3,040,552,298,327

is almost equals to

FILE: Number of bytes read 3,363,917,397,416

for mappers, while I have already set

conf.set("mapred.compress.map.output", "true");

it seems them compressing algorithm does not work for my job? why is this?

Shawn
  • 1,441
  • 4
  • 22
  • 36

1 Answers1

1

Does your job have a reducer?

If so, check 'Reduce shuffle bytes'.If that is considerably less than(1/5th or so) 'Map output bytes', you may assume map output is compressed.Compression happens after map is done,So, it might be showing actual data size it has output and not the compressed size.

If you still have doubt on whether it is working,submit the job with and without compression and compare 'Reduce shuffle bytes'.As far as map output compression is concerned 'Reduce shuffle bytes' is all that matters.

  • thanks Map output bytes=3219090158272 Reduce shuffle bytes=1514030378633 does it mean the default compress algorithm is not suitable to my data(pure text) – Shawn Sep 16 '13 at 09:43
  • Looks like it.I never used the default codec.Can you keep conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec"); and see the numbers again? You may also want to try lzo,if it is available in your distro. – Eswara Reddy Adapa Sep 16 '13 at 16:11