How to convert gzip to bzip2 via HDFS / Hadoop

Question

I have a ton of data files coming in from a client, all gzipped. I want them in .bzip2 as that is splittable and preferable for the intense analysis I have ahead.

Full disclosure: I use Hive and generally have yet to do more than very basic hadoop jobs.

My simple attempt to use a piped command appears to work but it is using the singular CPU of the master node for the operations, which will complete in 2017 for the 12TB of transforms ahead...

hadoop fs -cat /rawdata/mcube/MarketingCube.csv.gz | gzip -dc | bzip2 > cube.bz2

Appreciate any tips on how to make this a MapReduce job so that I can do this (once) for all the files that I'll be hitting repeatedly this weekend. Thanks.

[Splittable Gzip](https://issues.apache.org/jira/browse/HADOOP-7076) adds some pseudo-split to gzip, see [Making gzip splittable for Hadoop](http://niels.basjes.nl/splittable-gzip) — Remus Rusanu, Apr 05 '14 at 18:14
Remus, thanks, but he says on github "Hadoop 1.x is not yet supported" so.... Need another option. https://github.com/nielsbasjes/splittablegzip — Todd Curry, Apr 05 '14 at 18:29
8 years later, have you found a good solution other than asking the client to change their format? — user1485864, Mar 08 '22 at 10:40
I honestly cannot recall how I solved this 8 years ago. 12TB of data was a lot in 2014, the need was immediate, and there was no way to get around the single-threaded un-gzipping. Today, however, I would look at pugz -- give that a look and see if it helps you. https://github.com/Piezoid/pugz — Todd Curry, Mar 08 '22 at 20:35

score 0 · Answer 1 · answered Apr 05 '14 at 21:01

0

What you can do is using the PailFile format of https://github.com/nathanmarz/dfs-datastores to store your gzipped files into smaller chunks that fit your HDFS block size.

This way your next jobs (being hive or other) can be parallelized on the various splits even if the file are gzipped.

answered Apr 05 '14 at 21:01

WiseTechi

3,528
1
22
15

http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2 -- see slide 7. I don't understand how PailFile makes something splittable that is inherently unusable in split form. – Todd Curry Apr 06 '14 at 01:53

How to convert gzip to bzip2 via HDFS / Hadoop

1 Answers1