Bash on Hadoop Streaming

Question

I have written a simple bash script. The exact code is here.
ideone.com/8XQCjH

#!/bin/bash
if ! bzip2 -t "$file"
then 
    printf '%s is corrupted\n' "$file"
    rm -f "$file"
    #echo "$file" "is corrupted" >> corrupted.log
else
    tar -xjvf "$file" -C ./uncompressed
    rm -f  "$file"
fi

Basically, it reads a compressed file, tests it and uncompresses it and moves it to another directory.

How do I modify this code so that it will be able to read files in a hdfs input directory instead and output to another hdfs output directory ?

I have seen some examples here which though involves reading the contents of the file. Though in my case, I am not interested in reading any contents.
http://www.oraclealchemist.com/news/tf-idf-hadoop-streaming-bash-part-1/

If anyone could write a hadoop command which unzips files in a hdfs or a similar example, that'll greatly help me.

Edit:
Try 1:
hadoop fs -get /input/temp.tar.bz2 | tar -xjv | hadoop fs -put - /output

Not good as it moves the file into the native filesystem, uncompresses it and puts it back into the output directory in hdfs.

Try 2:
wrote a script uncompress.sh with just one line of code

uncompress.sh
tar -xjv


hadoop jar contrib/streaming/hadoop-streaming.jar \
-numReduceTasks 0 \
-file /home/hadoop/uncompress.sh \
-input /input/temp.tar.bz2 \
-output /output \
-mapper uncompress.sh \
-verbose

However this gave the below error.

INFO mapreduce.Job: Task Id : attempt_1409019525368_0015_m_000002_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2

Thanks

Hi everyone, I have edited the question as per request. Pls reconsider and put it back online. Thanks. — prog_guy, Aug 26 '14 at 11:02
The question is unclear. You should point out that you have a big `.tar.bz2` file and you want to parallelize the decompression splitting automatically into subjobs — pqnet, Aug 27 '14 at 09:53
Hi, I dont have a big tar.bz2 but just thousands upon thousands of medium(300Mb) size files. — prog_guy, Sep 03 '14 at 07:23

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

From the man page of bzip2:

-t --test

Check integrity of the specified file(s), but don't decompress them. This really performs a trial decompression and throws away the result

This means that there is no way to check the file without reading it. Also, if you are going to perform the decompression after that if the archive is deemed valid, you should probably decompress it directly.

That said, you can use

hadoop fs -cat hdfs://my_file_name | bzip2 -ct

to test the file and

tmpdir=`mktmp -d`
hadoop fs -cat hdfs://my_file_name | tar jxv -C $tmpdir
hadoop fs -copyFromLocal $tmpdir/ hdfs://dest_dir

to decompress it. There is no way to have tar write the files directly into hdfs. Hadoop streaming is thought as "download the stuff you need, perform the job in a temp directory, upload them back".

That said, you are using hadoop to perform decompression of a large number of files, or you want to parallelize the decompression of a big single giant file? In the second case you have to write an ad-hoc program to split the input into multiple parts, and decompress them. Hadoop will not automatically parallelize tasks for you. In the first case, you can use a script like this as mapper:

#!/bin/bash
while IFS="\n" read filename ; do
tmpdir=`mktmp -d`
hadoop fs -cat "hdfs:/$filename" | tar jxv -C $tmpdir
hadoop fs -copyFromLocal $tmpdir/ "hdfs:/$filename".dir/
rm -rf $tmpdir
done

and as input you use instead a file with the list of the tar.bz2 files to decompress

...
/path/my_file.tar.bz2
/path2/other_file.tar.bz2
....

That said, you are using hadoop to perform decompression of a large number of files. I am intending to do this.
Your while loop looks like a sequential operation.I mean only one file will be decompressed at any moment disregarding the number of nodes, I spawn up in the cluster. — prog_guy, Aug 27 '14 at 02:06
@prog_guy You are being a smartass, but you didn't read the whole answer. I clearly pointed out that Hadoop will not make your process automatically parallel if you can't split it into many small jobs, and as far as I know there is no straightforward way to split a `.tar.bz2` into many small files and decompress them separately. See `http://stackoverflow.com/questions/10282296/split-tar-bz2-file-and-extract-each-individually`. — pqnet, Aug 27 '14 at 09:48
@prog_guy If you are experimenting with Hadoop trying to learn it you picked the wrong problem, try solving another one. If you really want to do cluster decompression you should drop Hadoop and look at http://compression.ca/mpibzip2/ — pqnet, Aug 27 '14 at 09:50
@prog_guy the while loop is executed for each mapper. If you have more 300MB files than cpu you should probably do more than one file for each cpu. Would you please try this solution? — pqnet, Sep 03 '14 at 13:03

Bash on Hadoop Streaming

1 Answers1