Passing directories to hadoop streaming : some help needed

Question

The context is that I am trying to run a streaming job on Amazon EMR (the web UI) with a bash script that I run like:

-input s3://emrdata/test_data/input -output s3://emrdata/test_data/output -mapper
s3://emrdata/test_data/scripts/mapperScript.sh -reducer NONE

The input directory has sub-directories in it and these sub-directories have gzipped data files.

The relevant part of mapperScript.sh that fails is :

for filename in "$input"/*; do

dir_name=`dirname $filename`
fname=`basename $filename`

echo "$fname">/dev/stderr

modelname=${fname}.model

modelfile=$model_location/$modelname

echo "$modelfile">/dev/stderr

inputfile=$dirname/$fname

echo "$inputfile">/dev/stderr

outputfile=$output/$fname

echo "$outputfile">/dev/stderr

# Will do some processing on the files in the sub-directories here

done # this is the loop for getting input from all sub-directories

Basically, I need to read the sub-directories in streaming mode and when I run this, hadoop complains saying :

2013-03-01 10:41:26,226 ERROR
org.apache.hadoop.security.UserGroupInformation (main):               
PriviledgedActionException as:hadoop cause:java.io.IOException: Not a
file:      s3://emrdata/test_data/input/data1 2013-03-01 10:41:26,226
ERROR org.apache.hadoop.streaming.StreamJob (main):  Error Launching
job : Not a file: s3://emrdata/test_data/input/data1

I am aware that a similar q has been asked here

The suggestion there was to write one's own InputFormat. I am wondering if I am missing something else in the way my script is written / EMR inputs are given, or whether writing my own InputFormat in Java is my only choice.

I have tried giving my input with a "input/*" to EMR as well, but no luck.

Did you already have a look at http://wiki.apache.org/hadoop/AmazonS3 and checked your settings of fs.default.name etc.? — Michael Hausenblas, Mar 01 '13 at 13:33
Michael, yes I did do that. Those settings seem fine, and moreover, work perfectly when there are no sub-directories being passed to the streaming script. — Girish Nathan, Mar 02 '13 at 02:39
Amar - nope, have not tried that. I will do that and check if it helps, thanks! — Girish Nathan, Mar 02 '13 at 02:39

Amar · Answer 1 · 2013-03-01T18:55:26.697

It seems that while there may be some temporary workarounds to this, but inherently hadoop doesn't support this yet as you may see that there is an open ticket on this here. So inputpatth/*/* may work for 2 level of subdierctories it may fail for further nesting.

The best thing you can do for now is get the listing of the files/folders-without-any-subdirectory and add them recursively after creating a csv list of inputPaths. You may use sinple tools like s3cmd for this.

Passing directories to hadoop streaming : some help needed

1 Answers1