python streaming mapreduce job on hadoop failed - missing log4j?

Question

I tried to run a python wordcount on hadoop 2.7.1 which is installed on Ubuntu 15.10 and I got an error:

log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Server).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Also I get RunTimeException error in the terminal and info that streaming failed and there's no output file.

I found a few threads saying that probably log4j.properties and log4j.xml are missing, also examples what should log4j.properties contain, I tried one example but no success. Where do I find the files in Hadoop directory (if I can find them) or how can I create them with the right configuration?

The code for mapper and reducer for wordcount is taken from here and it runs absolutely fine with

input.txt|./mapper.py|sort|./reducer.py

However, I tried several times to run it on hadoop and it fails. I used different commands trying both when python files are copied to hdfs and when they are on the local file system: This one did not work:

hadoop hadoop-streaming-2.7.1.jar -mapper /user/mapper.py -reducer /user/reducer.py -input/input_file.txt -output /user/output

nor this one:

hadoop hadoop-streaming-2.7.1.jar -mapper "python /user/mapper.py" -reducer "python /user/reducer.py" -input/input_file.txt -output /user/output

This one did work (python files in the local file system):

hadoop hadoop-streaming-2.7.1.jar -mapper "python /home/user_name/Documents/mapper.py" -reducer "python /home/user_name/Documents/reducer.py -input /user/input_file.txt -output /user/output

All the files have the right permissions.

The output - after the standard beginning - is as follows:

16/02/15 09:47:48 INFO mapreduce.Job:  map 0% reduce 0%
16/02/15 09:48:05 INFO mapreduce.Job: Task Id : attempt_1455529218252_0001_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:112)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:78)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:449)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:112)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:78)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
... 17 more
Caused by: java.lang.RuntimeException: configuration exception
    at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222)
    at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
... 22 more
Caused by: java.io.IOException: Cannot run program "/user/mr/mapper.py": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209)
... 23 more
Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:248)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 24 more

And there's a lot more but the final output is about the streaming job failed:

16/02/15 09:49:07 INFO mapreduce.Job: Counters: 13
    Job Counters 
        Failed map tasks=7
        Killed map tasks=1
        Launched map tasks=8
        Other local map tasks=6
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=135543
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=135543
        Total vcore-seconds taken by all map tasks=135543
        Total megabyte-seconds taken by all map tasks=138796032
    Map-Reduce Framework
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
16/02/15 09:49:07 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!

What could be the reason for the python code not working when invoked from hdfs?

These are not errors, just warnings and they do not affect your job. Please update your question with your RunTimeException error. — Mobin Ranjbar, Feb 15 '16 at 06:27

score 0 · Accepted Answer · answered Feb 15 '16 at 13:15

0

You should just supply the name of the local python files as arguments to -mapper and -reducer. They don't need to be on HDFS, nor should you supply a string with the command line to execute the scripts.

You also need to supply a -file argument for each script. Try using

hadoop hadoop-streaming-2.7.1.jar -file /home/user_name/Documents/mapper.py -file /home/user_name/Documents/reducer.py -mapper /home/user_name/Documents/mapper.py -reducer /home/user_name/Documents/reducer.py -input /input_file.txt -output /user/output

answered Feb 15 '16 at 13:15

semiserious

610
1
8
12

Does it not make a difference where the scripts are stored in terms of the time taken for execution of the scripts? Is it going to be the same if I run them from the local file system or HDFS? What about cloud like EMR or HDInsight is it going to be the same? – piterd Feb 15 '16 at 13:28
You can't "run" the scripts from HDFS. HDFS is used to store input and output data, not mapper/reducer code - the execution of code is handled by the Hadoop application runtime, which interfaces with HDFS to read input data into the pipeline and writes the output data to the specified location. MapReduce performance, execution time, etc. is dependent on things like cluster size and configuration, and on the underlying hardware of the node(s) doing the processing. – semiserious Feb 15 '16 at 13:46
I tried to do it the wrong way then. Thanks for the answer. – piterd Feb 15 '16 at 13:48

python streaming mapreduce job on hadoop failed - missing log4j?

1 Answers1