How big is your file? And what is your block size? Bzip2 is splittable so when your file size exceeds your block size and your Bzip2 codec is configured right your file will be split automatically and thus your map tasks would increase automatically.
The properties in mapred-site.xml
are there to specify your job's (intermediate) output. When you use compressed files as input you should set this in core-site.xml
using io.compression.codecs
.
Also, if I were you, I would have a look at LZO. By default LZO archives aren't splittable but there's a way to index them so they become splittable. LZO does compress less in comparison with Bzip2 but is way faster. I compressed a 32GB textfile using Bzip2. Bzip2 compressed the file to 1.6GB but it took 6.5 hours. When I did the same using LZO it returned me a 5GB file but it did it in 30 minutes. The difference in decompression is even bigger. Also Bzip2 uses a lot more memory.
On how to index LZO files, have a look here: https://github.com/twitter/hadoop-lzo