2

The context is that I am trying to run a streaming job on Amazon EMR (the web UI) with a bash script that I run like:

-input s3://emrdata/test_data/input -output s3://emrdata/test_data/output -mapper
s3://emrdata/test_data/scripts/mapperScript.sh -reducer NONE

The input directory has sub-directories in it and these sub-directories have gzipped data files.

The relevant part of mapperScript.sh that fails is :

for filename in "$input"/*; do

dir_name=`dirname $filename`
fname=`basename $filename`

echo "$fname">/dev/stderr

modelname=${fname}.model

modelfile=$model_location/$modelname

echo "$modelfile">/dev/stderr

inputfile=$dirname/$fname

echo "$inputfile">/dev/stderr

outputfile=$output/$fname

echo "$outputfile">/dev/stderr

# Will do some processing on the files in the sub-directories here

done # this is the loop for getting input from all sub-directories

Basically, I need to read the sub-directories in streaming mode and when I run this, hadoop complains saying :

2013-03-01 10:41:26,226 ERROR
org.apache.hadoop.security.UserGroupInformation (main):               
PriviledgedActionException as:hadoop cause:java.io.IOException: Not a
file:      s3://emrdata/test_data/input/data1 2013-03-01 10:41:26,226
ERROR org.apache.hadoop.streaming.StreamJob (main):  Error Launching
job : Not a file: s3://emrdata/test_data/input/data1

I am aware that a similar q has been asked here

The suggestion there was to write one's own InputFormat. I am wondering if I am missing something else in the way my script is written / EMR inputs are given, or whether writing my own InputFormat in Java is my only choice.

I have tried giving my input with a "input/*" to EMR as well, but no luck.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470

1 Answers1

2

It seems that while there may be some temporary workarounds to this, but inherently hadoop doesn't support this yet as you may see that there is an open ticket on this here. So inputpatth/*/* may work for 2 level of subdierctories it may fail for further nesting.

The best thing you can do for now is get the listing of the files/folders-without-any-subdirectory and add them recursively after creating a csv list of inputPaths. You may use sinple tools like s3cmd for this.

Amar
  • 11,930
  • 5
  • 50
  • 73