How to deal with .gz input files with Hadoop?

Question

Please allow me to provide a scenario:

hadoop jar test.jar Test inputFileFolder outputFileFolder

where

test.jar sorts info by key, time, and place
inputFileFolder contains multiple .gz files, each .gz file is about 10GB
outputFileFolder contains bunch of .gz files

My question is which is the best way to deal with those .gz file in the inputFileFolder? Thank you!

Ben Watson · Accepted Answer · 2015-11-05T16:44:40.747

1

Hadoop will automatically detect and read .gz files. However as .gz is not a splittable compression format, each file will be read by a single mapper. Your best bet is to use another format such as Snappy, or to decompress, split and re-compress into smaller, block-sized files.

edited Nov 05 '15 at 16:44

answered Nov 05 '15 at 16:37

Ben Watson

5,357
4
42
65

How to deal with .gz input files with Hadoop?

1 Answers1