Pig script to compress and decompress the hdfs data in bzip2

Question

How to compress hdfs data to bzip2 using pig such that on decompression it should give the same dir structure which it had initially.I am new to pig.

I tried to compress with bzip2 but it generated many files due to many mappers being spawned and hence reverting back to plain text file(initial form) in the same dir structure becomes difficult.

Just like how in unix if we compress bzip2 using tarball and then after decompression of bzip2.tar gives me exactly same data and folder structure which it had initially.

eg Compression:- tar -cjf compress_folder.tar.bz2 compress_folder/

Decompression:- tar -jtvf compress_folder.tar.bz2

will give exactly same dir st.

sumitya · Answer 1 · 2016-06-11T19:18:40.967

Approach 1:

you can try running one reducer to store only 1 file on hdfs. but compromise will be performance here.

set default_parallel 1;

to compress data, set these parameters in pig script , if not tried this way:-

set output.compression.enabled true;
SET mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec';

just use JsonStorage while storing file

STORE file INTO '/user/hduser/data/usercount' USING JsonStorage();

Eventually you also want to read data, use TextLoader

data = LOAD '/user/hduser/data/usercount/' USING TextLoader;

Approach 2:

filecrush: file merge utility available at @Mr. github

Pig script to compress and decompress the hdfs data in bzip2

1 Answers1