Unzip files using hadoop streaming

Question

I have many files in HDFS, all of them a zip file with one CSV file inside it. I'm trying to uncompress the files so I can run a streaming job on them.

I tried:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.reduce.tasks=0 \
    -mapper /bin/zcat -reducer /bin/cat \
    -input /path/to/files/ \
    -output /path/to/output

However I get an error (subprocess failed with code 1) I also tried running on a single file, same error.

Any advice?

score 1 · Answer 1 · answered Sep 30 '14 at 16:42

The root cause of the problem is: you get many (text-)infos from hadoop (before you can receive the data).

e.g. hdfs dfs -cat hdfs://hdm1.gphd.local:8020/hive/gphd/warehouse/my.db/my/part-m-00000.gz | zcat | wc -l will NOT work either - with "gzip: stdin: not in gzip format" error message.

Therefore you should skip this "unneccesary" infos. In my case I have to skip 86 lines

Therefore my one line command will be this (for counting the records): hdfs dfs -cat hdfs://hdm1.gphd.local:8020/hive/gphd/warehouse/my.db/my/part-m-00000.gz |tail -n+86 | zcat | wc -l

Note: this is a workaround (not a real solution) and very ugly - because of "86" - but it works fine :)

score 0 · Answer 2 · edited Sep 01 '17 at 05:26

0

A Simple way to unzip / uncompress a file within HDFS for whatever reason

hadoop fs -text /hdfs-path-to-zipped-file.gz | hadoop fs -put - /hdfs-path-to-unzipped-file.txt

edited Sep 01 '17 at 05:26

aristotll

8,694
6
33
53

answered Jun 29 '14 at 16:13

Jay

1,022
10
10

That's for a gzipped (compressed) single file, not a Zip archive, which can have multiple compressed members. – Ken Williams Jul 07 '15 at 20:39

score 0 · Answer 3 · edited Sep 01 '17 at 05:27

After experimenting around, I discovered that if you do this modification to hadoop streaming, you will get all your gzipped files uncompressed in a new directory. The file names are all lost (renamed to the typical part-XXXX name), but this worked for me.

I speculate this works because hadoop automatically uncompresses gzipped files under the hood, and cat is just echoing that unzipped output

hadoop jar /usr/iop/4.2.0.0/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.reduce.tasks=0 \
    -mapper /bin/cat \
    -input  /path-to-gzip-files-directory \
    -output /your-gunzipped-directory

score -2 · Answer 4 · answered Aug 24 '13 at 23:25

-2

Hadoop can read files compressed in the gzip format, but that's different from the zip format. Hadoop cannot read zip files AFAIK.

answered Aug 24 '13 at 23:25

user394827

89
5

That's irrelevant - the mapper task decodes the Zip file, Hadoop doesn't need to know anything about its format. – Ken Williams Jul 07 '15 at 20:38

Unzip files using hadoop streaming

4 Answers4

Linked