0

Please allow me to provide a scenario:

hadoop jar test.jar Test inputFileFolder outputFileFolder

where

  • test.jar sorts info by key, time, and place
  • inputFileFolder contains multiple .gz files, each .gz file is about 10GB
  • outputFileFolder contains bunch of .gz files

My question is which is the best way to deal with those .gz file in the inputFileFolder? Thank you!

frankilee
  • 77
  • 1
  • 7

1 Answers1

1

Hadoop will automatically detect and read .gz files. However as .gz is not a splittable compression format, each file will be read by a single mapper. Your best bet is to use another format such as Snappy, or to decompress, split and re-compress into smaller, block-sized files.

Ben Watson
  • 5,357
  • 4
  • 42
  • 65