How to increase map tasks for MapReduce with bzip2 inputformat

Question

I developed mr, that correctly works with a text file running multiple map tasks, but I need to run job either on archives. My choice is bzip2 archive. With such archives my job works with only one map task.

Does anyone know, how I can increase map tasks?

Hadoop version: Hadoop 0.20.2-cdh3u5

I tried to edit mapred-site.xml with different parameters and it didn't work.

score 0 · Answer 1 · answered Jan 29 '13 at 10:01

How big is your file? And what is your block size? Bzip2 is splittable so when your file size exceeds your block size and your Bzip2 codec is configured right your file will be split automatically and thus your map tasks would increase automatically.

The properties in mapred-site.xml are there to specify your job's (intermediate) output. When you use compressed files as input you should set this in core-site.xml using io.compression.codecs.

Also, if I were you, I would have a look at LZO. By default LZO archives aren't splittable but there's a way to index them so they become splittable. LZO does compress less in comparison with Bzip2 but is way faster. I compressed a 32GB textfile using Bzip2. Bzip2 compressed the file to 1.6GB but it took 6.5 hours. When I did the same using LZO it returned me a 5GB file but it did it in 30 minutes. The difference in decompression is even bigger. Also Bzip2 uses a lot more memory.

On how to index LZO files, have a look here: https://github.com/twitter/hadoop-lzo

score 0 · Answer 2 · answered Jan 29 '13 at 11:11

0

According to this thread, MAPREDUCE-830 is also needed in order for Bzip2 files to be splittable (HADOOP-4012) for MapReduce jobs. MAPREDUCE-830 isn't available on CDH3u5.

answered Jan 29 '13 at 11:11

omid

370
1
10

How to increase map tasks for MapReduce with bzip2 inputformat

2 Answers2