1

I have a ton of data files coming in from a client, all gzipped. I want them in .bzip2 as that is splittable and preferable for the intense analysis I have ahead.

Full disclosure: I use Hive and generally have yet to do more than very basic hadoop jobs.

My simple attempt to use a piped command appears to work but it is using the singular CPU of the master node for the operations, which will complete in 2017 for the 12TB of transforms ahead...

hadoop fs -cat /rawdata/mcube/MarketingCube.csv.gz | gzip -dc | bzip2 > cube.bz2 

Appreciate any tips on how to make this a MapReduce job so that I can do this (once) for all the files that I'll be hitting repeatedly this weekend. Thanks.

Todd Curry
  • 1,045
  • 1
  • 10
  • 23
  • [Splittable Gzip](https://issues.apache.org/jira/browse/HADOOP-7076) adds some pseudo-split to gzip, see [Making gzip splittable for Hadoop](http://niels.basjes.nl/splittable-gzip) – Remus Rusanu Apr 05 '14 at 18:14
  • Remus, thanks, but he says on github "Hadoop 1.x is not yet supported" so.... Need another option. https://github.com/nielsbasjes/splittablegzip – Todd Curry Apr 05 '14 at 18:29
  • 8 years later, have you found a good solution other than asking the client to change their format? – user1485864 Mar 08 '22 at 10:40
  • I honestly cannot recall how I solved this 8 years ago. 12TB of data was a lot in 2014, the need was immediate, and there was no way to get around the single-threaded un-gzipping. Today, however, I would look at pugz -- give that a look and see if it helps you. https://github.com/Piezoid/pugz – Todd Curry Mar 08 '22 at 20:35

1 Answers1

0

What you can do is using the PailFile format of https://github.com/nathanmarz/dfs-datastores to store your gzipped files into smaller chunks that fit your HDFS block size.

This way your next jobs (being hive or other) can be parallelized on the various splits even if the file are gzipped.

WiseTechi
  • 3,528
  • 1
  • 22
  • 15
  • http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2 -- see slide 7. I don't understand how PailFile makes something splittable that is inherently unusable in split form. – Todd Curry Apr 06 '14 at 01:53