The context is that I am trying to run a streaming job on Amazon EMR (the web UI) with a bash script that I run like:
-input s3://emrdata/test_data/input -output s3://emrdata/test_data/output -mapper
s3://emrdata/test_data/scripts/mapperScript.sh -reducer NONE
The input directory has sub-directories in it and these sub-directories have gzipped data files.
The relevant part of mapperScript.sh
that fails is :
for filename in "$input"/*; do
dir_name=`dirname $filename`
fname=`basename $filename`
echo "$fname">/dev/stderr
modelname=${fname}.model
modelfile=$model_location/$modelname
echo "$modelfile">/dev/stderr
inputfile=$dirname/$fname
echo "$inputfile">/dev/stderr
outputfile=$output/$fname
echo "$outputfile">/dev/stderr
# Will do some processing on the files in the sub-directories here
done # this is the loop for getting input from all sub-directories
Basically, I need to read the sub-directories in streaming mode and when I run this, hadoop complains saying :
2013-03-01 10:41:26,226 ERROR
org.apache.hadoop.security.UserGroupInformation (main):
PriviledgedActionException as:hadoop cause:java.io.IOException: Not a
file: s3://emrdata/test_data/input/data1 2013-03-01 10:41:26,226
ERROR org.apache.hadoop.streaming.StreamJob (main): Error Launching
job : Not a file: s3://emrdata/test_data/input/data1
I am aware that a similar q has been asked here
The suggestion there was to write one's own InputFormat. I am wondering if I am missing something else in the way my script is written / EMR inputs are given, or whether writing my own InputFormat in Java is my only choice.
I have tried giving my input with a "input/*" to EMR as well, but no luck.