Hadoop mapper compress output doesn't work?

Question

I am using hadoop cdh4.1.2, and my mapper program is almost a echo of input data. But in my job status page, I saw

FILE: Number of bytes written  3,040,552,298,327

is almost equals to

FILE: Number of bytes read 3,363,917,397,416

for mappers, while I have already set

conf.set("mapred.compress.map.output", "true");

it seems them compressing algorithm does not work for my job? why is this?

score 1 · Accepted Answer · answered Sep 16 '13 at 06:47

1

Does your job have a reducer?

If so, check 'Reduce shuffle bytes'.If that is considerably less than(1/5th or so) 'Map output bytes', you may assume map output is compressed.Compression happens after map is done,So, it might be showing actual data size it has output and not the compressed size.

If you still have doubt on whether it is working,submit the job with and without compression and compare 'Reduce shuffle bytes'.As far as map output compression is concerned 'Reduce shuffle bytes' is all that matters.

answered Sep 16 '13 at 06:47

Eswara Reddy Adapa

995
5
11

thanks Map output bytes=3219090158272 Reduce shuffle bytes=1514030378633 does it mean the default compress algorithm is not suitable to my data(pure text) – Shawn Sep 16 '13 at 09:43
Looks like it.I never used the default codec.Can you keep conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec"); and see the numbers again? You may also want to try lzo,if it is available in your distro. – Eswara Reddy Adapa Sep 16 '13 at 16:11

Hadoop mapper compress output doesn't work?

1 Answers1