BZip2 file read in Hadoop

Question

I heard we can use multiple mappers to read different parts of one bzip2 file in parallel in Hadoop, to increase performance. But I cannot find related samples after search. Appreciate if anyone could point me to related code snippet. Thanks.

BTW: is gzip has the same feature (multiple mapper process different parts of one gzip file in parallel).

score 4 · Accepted Answer · answered Dec 26 '12 at 12:15

4

If you look at: http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/30662, you will find that bzip2 format is indeed splittable and multiple mappers can work on one file. The patch was submitted at: https://issues.apache.org/jira/browse/HADOOP-4012. However, it seems it is available only above HADOOP 0.21.0.

From personal experience in order to use this technique of bzip2 there is nothing different that you need to do. hadoop should pick it up automatically depending on your min split size.

bzip2 compressed data by blocks and therefore it is possible to decompress it in blocks and send each block to a separate mapper. However, gzip does not have such a technique and therefore this cannot be sent to different mappers.

answered Dec 26 '12 at 12:15

Varun Shingal

121
5

Thanks Varun, "However, gzip does not have such a technique and therefore this cannot be sent to different mappers." -- is there any Hadoop document claims that? – Lin Ma Dec 26 '12 at 15:24
1

As I stated, it is not possible to decompress gzip in parallel, unless you have already decompressed that same file once serially and built a map of entry points _or_ the gzip file has been specially prepared for parallel decompression, which requires custom software for that purpose. – Mark Adler Dec 26 '12 at 18:27
@Varun Shingal does one has to add bzip2 to hadoop in order to use it or does it come shipped with hadoop by default? thanx! – theexplorer Feb 05 '15 at 09:42
@theexplorer Hadoop has supported this, by default I think, since release 0.21.0 in 2010. It's great. https://issues.apache.org/jira/browse/HADOOP-4012 – nealmcb Feb 05 '16 at 21:54

Mark Adler · Answer 2 · 2012-12-26T19:22:01.197

2

You can look a pbzip2 for an example of parallel bz2 compression and decompression.

There is a parallel gzip as well, pigz. It does parallel compression, but not parallel decompression. The deflate format is not suited to parallel decompression. However you can either a) prepare a special gzip stream with resets of the history, or b) you can build an index into a gzip file on the first pass. Either way, you can then read different parts in parallel, or have more efficient random access.

edited Dec 26 '12 at 19:22

answered Dec 26 '12 at 07:07

Mark Adler

101,978
13
118
158

Thanks Mark for providing detailed information. Actually what I am asking is how to work with bzip2 and gzip format on Hadoop. For example, how to use multiple mappers to read one bzip2 file in parallel in multiple mappers. – Lin Ma Dec 26 '12 at 07:43
1

I figured that. You can get a start by seeing how it's done in C. – Mark Adler Dec 26 '12 at 08:16
I am not sure Hadoop has built-in function to support multiple mappers to read one bzip2 or gzip file in parallel? – Lin Ma Dec 26 '12 at 09:24

BZip2 file read in Hadoop

2 Answers2

Linked