I have written a simple bash script. The exact code is here.
ideone.com/8XQCjH
#!/bin/bash
if ! bzip2 -t "$file"
then
printf '%s is corrupted\n' "$file"
rm -f "$file"
#echo "$file" "is corrupted" >> corrupted.log
else
tar -xjvf "$file" -C ./uncompressed
rm -f "$file"
fi
Basically, it reads a compressed file, tests it and uncompresses it and moves it to another directory.
How do I modify this code so that it will be able to read files in a hdfs input directory instead and output to another hdfs output directory ?
I have seen some examples here which though involves reading the contents of the file. Though in my case, I am not interested in reading any contents.
http://www.oraclealchemist.com/news/tf-idf-hadoop-streaming-bash-part-1/
If anyone could write a hadoop command which unzips files in a hdfs or a similar example, that'll greatly help me.
Edit:
Try 1:
hadoop fs -get /input/temp.tar.bz2 | tar -xjv | hadoop fs -put - /output
Not good as it moves the file into the native filesystem, uncompresses it and puts it back into the output directory in hdfs.
Try 2:
wrote a script uncompress.sh with just one line of code
uncompress.sh
tar -xjv
hadoop jar contrib/streaming/hadoop-streaming.jar \
-numReduceTasks 0 \
-file /home/hadoop/uncompress.sh \
-input /input/temp.tar.bz2 \
-output /output \
-mapper uncompress.sh \
-verbose
However this gave the below error.
INFO mapreduce.Job: Task Id : attempt_1409019525368_0015_m_000002_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
Thanks