I have a ton of data files coming in from a client, all gzipped. I want them in .bzip2 as that is splittable and preferable for the intense analysis I have ahead.
Full disclosure: I use Hive and generally have yet to do more than very basic hadoop jobs.
My simple attempt to use a piped command appears to work but it is using the singular CPU of the master node for the operations, which will complete in 2017 for the 12TB of transforms ahead...
hadoop fs -cat /rawdata/mcube/MarketingCube.csv.gz | gzip -dc | bzip2 > cube.bz2
Appreciate any tips on how to make this a MapReduce job so that I can do this (once) for all the files that I'll be hitting repeatedly this weekend. Thanks.