Is compression/decompression of gzip data transparent in Hadoop/PIG?

Question

I read somewhere that Hadoop has a built-in support for compression and decompression but I guess it is about mapper output (by setting some properties)?

I wonder if there is any particular PIG load/store functions I can use for reading compressed data or outputting data as compressed?

score 6 · Accepted Answer · answered Mar 27 '12 at 20:36

The PigStorage handles compressed input by examining the file names:

*.bz2 / *.bz - org.apache.pig.bzip2r.Bzip2TextInputFormat
Everything else uses org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat -- This extends o.a.h.mapreduce.TextinputFormat which can handle .gz and zippy files if you have the codecs installed

Output is handled via some properties:

output.compression.enabled - true / false
output.compression.codec - the class name of the codec to use (org.apache.hadoop.io.compress.GzipCodec for gzip)

If you're feeling up to it, digging through the PigStorage.java may be of interest to you

http://my.safaribooksonline.com/book/-/9781449317881/8dot-making-pig-fly/id2907215 also gives some more details about intermediate compression — Chris White, Mar 27 '12 at 20:37

Is compression/decompression of gzip data transparent in Hadoop/PIG?

1 Answers1

Linked