8

I am trying to run a hadoop-streaming python job.

bin/hadoop jar contrib/streaming/hadoop-0.20.1-streaming.jar 
-D stream.non.zero.exit.is.failure=true 
-input /ixml 
-output /oxml 
-mapper scripts/mapper.py 
-file scripts/mapper.py 
-inputreader "StreamXmlRecordReader,begin=channel,end=/channel" 
-jobconf mapred.reduce.tasks=0 

I made sure mapper.py has all the permissions. It errors out saying

Caused by: java.io.IOException: Cannot run program "mapper.py":     
error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
    at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
... 19 more
Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.(UNIXProcess.java:53)
    at java.lang.ProcessImpl.start(ProcessImpl.java:91)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)

I tried copying mapper.py to hdfs and give the same hdfs://localhost/mapper.py link, that does not work too! Any thoughts on how to fix this bug?.

vkris
  • 2,095
  • 7
  • 22
  • 30

8 Answers8

8

Looking at the example on the HadoopStreaming wiki page, it seems that you should change

-mapper scripts/mapper.py 
-file scripts/mapper.py 

to

-mapper mapper.py 
-file scripts/mapper.py 

since "shipped files go to the working directory". You might also need to specify the python interpreter directly:

-mapper /path/to/python mapper.py 
-file scripts/mapper.py 
Bkkbrad
  • 3,087
  • 24
  • 30
  • thanks Brad, but the error changed to /System/Library/Frameworks/Python.framework/Versions/2.5/Resources/Python.app/Contents/MacOS/Python: can't open file 'mapper.py': [Errno 2] No such file or directory java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2 – vkris Dec 03 '10 at 14:00
  • I have a working script that has -file ../scripts/mapper.py -mapper ../scripts/mapper.py – Brig Dec 10 '10 at 21:02
4

Your problem most likely is that python executable does not exist on the slaves (where TaskTracker is running). Java will give the same error message.

Install it everywhere where it's used. Un your file you can use shebang as you probably already do:

#!/usr/bin/python -O
rest
of
the
code

Make sure that the path after the shebang is the same where python is installed on the TaskTrackers.

gphilip
  • 1,114
  • 15
  • 33
2

One other sneaky thing can cause this. If your line-endings on the script are DOS-style, then your first line (the "shebang line") may look like this to the naked eye:

#!/usr/bin/python

...my code here...

but its bytes look like this to the kernel when it tries to execute your script:

% od -a myScript.py
0000000   #   !   /   u   s   r   /   b   i   n   /   p   y   t   h   o
0000020   n  cr  nl  cr  nl   .   .   .   m   y  sp   c   o   d   e  sp
0000040   h   e   r   e   .   .   .  cr  nl

It's looking for an executable called "/usr/bin/python\r", which it can't find, so it dies with "No such file or directory".

This bit me today, again, so I had to write it down somewhere on SO.

Ken Williams
  • 22,756
  • 10
  • 85
  • 147
  • Found the same thought here: http://stackoverflow.com/questions/20218521/hadoop-streaming-external-mapper-script-file-not-found – Jeevs May 12 '15 at 17:26
  • Got snagged by this one over the weekend. Thanks Obama! :D – dave Aug 17 '15 at 16:03
1

I ran into the exact same issue on a CDH4 Hadoop cluster trying to run a streaming python job. The trick is to add in your mapper / reducer file as the first lines:

import sys
sys.path.append('.')

This will make python look in the current working directory and it should then be able to run, also make sure that your shebang is correct.

DrDee
  • 3,549
  • 6
  • 30
  • 37
1

I have faced same issue while running map reduce with python code. Solution is: We have to specify "-file" as well in front of mapper and reducer.

Here is the command:

hadoop jar /opt/cloudera/parcels/CDH-5.12.2-1.cdh5.12.2.p0.4/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.12.2.jar **-file /home/mapper.py** -mapper /home/mapper.py   **-file /home/reducer.py** -reducer /home/reducer.py  -input /system/mainstream/tmp/file.txt -output /system/mainstream/tmp/output
Taegost
  • 1,208
  • 1
  • 17
  • 26
0

File not found error sometimes does not means "File not found", instead it means "Cannot execute this script".

Knowing this I solved problems like this, when you are facing with issues ( no java ) on streaming I suggest you to follow this check list:

  1. Does the scripts run? Don't start is using the interpreter i.e. python myScript.py make it executable at start it as ./myScript.py this is the way the streaming will call your script.
  2. use -verbose to see what is going into the jar which will be deployed into the container, sometime this help.
  3. Inside the containers scripts are symlink not real files.
  4. Files which are moved using -file are not in folders. -mapper folder/script.py or -reducer folder/script.py are treat as script.py
  5. Containers and anything inside them are deleted after the job completes, if you want to see what is happening into a container move it into HDFS, I.E: replacing the mapper or the reducer with a .sh script which does the work.

This checklist helped me a lot, I hope can be useful also for you.

Here follows the classic log with the ambiguous error message.

It's true, it cannot run the program.

Caused by: java.io.IOException: Cannot run program "/hadoop/yarn/local/usercache/root/appcache/application_1475243242823_0007/container_1475243242823_0007_01_000004/./reducer.py": 
error=2, No such file or directory

It's the reason the lie.

    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
    at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
    ... 15 more

Read this:

Caused by: java.io.IOException: error=2, No such file or directory

It's a lie, file does exists if -verbose shows it into the packaging list.

    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:187)
    at java.lang.ProcessImpl.start(ProcessImpl.java:130)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
ozw1z5rd
  • 3,034
  • 3
  • 32
  • 49
0

Does your mapper.py have execute permission on it ? If not then you need it.

chmod a+x scripts/mapper.py

Hadoop forks and runs the the script before it writes/reads to std so you need to give it execute permission to run.

Joe Stein
  • 1,255
  • 2
  • 11
  • 14
  • yeah it has. I mentioned in the post that it has all the permissions. – vkris Dec 04 '10 at 01:52
  • 1
    Maybe you should go to one of your task tracker nodes and try running cat somedata.csv|./mapper.py you might find an error from the data node with something anomalous. Also is the scripts directory a sibling of bin and contrib ? – Joe Stein Dec 04 '10 at 02:04
  • I am trying to run in a pseudo distributed mode. I did try running with an actual cluster, still gives the same problem. so running cat inputfile|./mapper.py works!! Yes, scripts directory is a sibling of bin, contrib. – vkris Dec 06 '10 at 17:11
0

I just received the same error when my mapper returns a null or empty string. So I had to do a check for the value:

try:
    # Skip over any errors

    word = words[18].strip()

        if (len(word) == 0):
            word = "UKNOWN"

    print '%s\t%s' % (word, 1)

except Value:
    pass
Brig
  • 10,211
  • 12
  • 47
  • 71
  • ooh! I tried with my input data, it was working when i did cat input.txt | python mapper.py – vkris Dec 10 '10 at 18:25
  • My test data pass the cat | mapper.py | reducer.py test too. I also had to add in error handling – Brig Dec 10 '10 at 20:59