Pipe command works but Mapreduce does not

Question

I have to join 6 sets of data pertaining to the amount of views for certain TV shows on various channels. 3 of the 6 sets of data contain a list of shows and the amount of views of each, e.g:

Show_Name 201

Another_Show 105

and so on...

The other 3 sets of data contain the shows and the channels on which each one is aired, e.g:

Show_Name ABC

Another_Show CNN

and so on...

I wrote the following Mapper in python to find on ABC channel:

#!/usr/bin/env python
import sys

all_shows_views = []
shows_on_ABC = []

for line in sys.stdin:
    line       = line.strip()   #strip out carriage return (i.e. removes line breaks).
    key_value  = line.split(",")   #split line into key and value, returns a list.
    key_in     = key_value[0]     #.split(" ") - Dont need the split(" ") b/c there is no date. 
    value_in   = key_value[1]     #value is 2nd item. 

    if value_in.isdigit():
        show = key_in
    all_shows_views.append(show + "\t" + value_in)
    if value_in == "ABC":            #check if the TV Show is ABC.       
    show = key_in
           shows_on_ABC.append(show)

for i in range(len(all_shows_views)):
    show_view = all_shows_views[i].split("\t")
    for c in range(len(shows_on_ABC)):
        if show_view[0] == shows_on_ABC[c]:
            print (show_view[0] + "\t" + show_view[1])

#Note that Hadoop expects a tab to separate key value
#but this program assumes the input file has a ',' separating key value.

The Mapper passes only the name of the show on ABC and the amount of views, e.g:

Show_name_on_ABC 120

The reducer, also in python, is as follows:

prev_show          = "  "    #initialize previous word  to blank string
line_cnt           = 0      #count input lines.
count            = 0        #keep running total.

for line in sys.stdin:
    line       = line.strip()           #strip out carriage return
    key_value  = line.split('\t')       #split line, into key and value, returns a list
    line_cnt   = line_cnt+1   
    curr_show  = key_value[0]             #key is first item in list, indexed by 0
    value_in   = key_value[1]             #value is 2nd item

    if curr_show != prev_show and line_cnt>1:
    #print "\n"
    #print "---------------------Total---------------------"
    #print "\n"
    print (prev_show + "\t" + str(count))
    #print "\n"
    #print "------------------End of Item------------------"
    #print "\n"
    count = 0
    else:
    count = count + int(key_value[1])
        #print key_value[0] + "\t" + key_value[1]

    prev_show = curr_show  #set up previous show for the next set of input lines.

print (curr_show + "\t" + str(count))

The reducer takes the list of shows on ABC with the number of views and keeps a counting average of each and prints out the total of each show (hadoop automatically orders the data in alphabetical order according to the key which is the name of the show in this case).

When I run this in the using the piping command in the Terminal as follwos:

cat Data*.text | /home/cloudera/mapper.py |sort| /home/coudera/reducer.py

I get a neat output with the correct totals as follows:

Almost_Games 49237

Almost_News 45589

Almost_Show 49186

Baked_Games 50603

When I run this problem in using the Hadoop command in the Terminal using the following command:

> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
   -input /user/cloudera/input \
   -output /user/cloudera/output_join \   
   -mapper /home/cloudera/mapper.py \   
   -reducer /home/cloudera/reducer.py

I get an unsuccessful error with the reducer being the culprit. The full error is as follows:

15/11/15 09:16:54 INFO mapreduce.Job: Job job_1447598349691_0003 failed with state FAILED due to: Task failed task_1447598349691_0003_r_000000
Job failed as tasks failed. failedMaps:0 failedReduces:1

15/11/15 09:16:54 INFO mapreduce.Job: Counters: 37
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=674742
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=113784
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=18
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Failed reduce tasks=4
        Launched map tasks=6
        Launched reduce tasks=4
        Data-local map tasks=6
        Total time spent by all maps in occupied slots (ms)=53496
        Total time spent by all reduces in occupied slots (ms)=18565
        Total time spent by all map tasks (ms)=53496
        Total time spent by all reduce tasks (ms)=18565
        Total vcore-seconds taken by all map tasks=53496
        Total vcore-seconds taken by all reduce tasks=18565
        Total megabyte-seconds taken by all map tasks=54779904
        Total megabyte-seconds taken by all reduce tasks=19010560
    Map-Reduce Framework
        Map input records=6600
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=36
        Input split bytes=729
        Combine input records=0
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=452
        CPU time spent (ms)=4470
        Physical memory (bytes) snapshot=1628909568
        Virtual memory (bytes) snapshot=9392836608
        Total committed heap usage (bytes)=1279262720
    File Input Format Counters 
        Bytes Read=113055
15/11/15 09:16:54 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!

Why would the piping command work and not the hadoop execution?

You'll need to find the log for the specific attempt(s) that failed. See http://stackoverflow.com/questions/3207238/where-does-hadoop-mapreduce-framework-send-my-system-out-print-statements-s. — Ben Watson, Nov 15 '15 at 19:59
I went to the log and found this error details: '2015-11-18 11:00:39,934 INFO [main] org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed! java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1:' any ideas? — Nicholas, Nov 18 '15 at 19:23
I did some further digging, and the error was pointing to the last print line of the reducer saying the variable "curr_show" is not defined (which it is, according to the piping command). When this line is removed, nothing there are no errors, but then nothing written to the output file? — Nicholas, Nov 18 '15 at 20:09
I don't know Python, but it looks like `curr_show` is only defined within the scope of the `for` loop, whereas the `print` statement is outside of it? — Ben Watson, Nov 18 '15 at 20:12
I am also relatively new to Python, I think you are correct in assuming the variable is only valid with in the `for` loop. I declared it outside of the loop and then the file executed with no errors. But the file it is writing to is empty? — Nicholas, Nov 18 '15 at 20:56

score 0 · Answer 1 · answered Nov 18 '15 at 09:50

It looks like you are not using the hadoop streaming command properly. Instead of

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
   -input /user/cloudera/input \
   -output /user/cloudera/output_join \   
   -mapper /home/cloudera/mapper.py \   
   -reducer /home/cloudera/reducer.py

in -mapper, you need to provide mapper command. Try

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
   -input /user/cloudera/input \
   -output /user/cloudera/output_join \  
   -mapper "python mapper.py" \   
   -reducer "python reducer.py" \
   -file /home/cloudera/mapper.py \   
   -file /home/cloudera/reducer.py

Also check the error log by opening any failed task at the tracking url, as above log is not much helpful.

I used the same format for a previous job and it worked just fine? But I will try the suggested method and report back. Thanks! — Nicholas, Nov 18 '15 at 19:27

score 0 · Answer 2 · answered Nov 19 '15 at 11:43

The Reducers Python Script is generating errors because the variable curr_show was only declared with in the line reading for loop. The reason for the error only occurring when using the Hadoop command and not for the piping command is because of scooping (with which I am very unfamiliar).

By declaring the curr_show variable outside the for loop, the final print command was able to execute.

prev_show          = "  "    #initialize previous word  to blank string
line_cnt           = 0      #count input lines.
count              = 0        #keep running total.
curr_show          = "  "

for line in sys.stdin:
    line       = line.strip()           #strip out carriage return
    key_value  = line.split('\t')       #split line, into key and value, returns a list
    line_cnt   = line_cnt+1   
    curr_show  = key_value[0]             #key is first item in list, indexed by 0
    value_in   = key_value[1]             #value is 2nd item

    if curr_show != prev_show and line_cnt>1:
        #print "\n"
        #print "---------------------Total---------------------"
        #print "\n"
        print (prev_show + "\t" + str(count))
        #print "\n"
        #print "------------------End of Item------------------"
        #print "\n"
        count = int(value_in)
    else:
        count = count + int(key_value[1])
        #print key_value[0] + "\t" + key_value[1]

    prev_show = curr_show  #set up previous show for the next set of input lines.

print (curr_show + "\t" + str(count))

Also, the count variable was changed to reset to the current value_in so that the current value at the time of the change in show is not lost.

score -1 · Answer 3 · edited Dec 14 '15 at 07:23

This mapper and reducer still doesn't work. I'm getting the below exception. Does any of you have found the issue?

The command used for this is:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -input     /user/cloudera/input_join -output /user/cloudera/output_join2 -mapper '/home/cloudera/join2.mapper.py' -reducer '/home/cloudera/join2.reducer.py'

Error Logs:

FATAL [IPC Server handler 5 on 51645] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1449644802746_0003_m_000001_0 - exited : java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

Pipe command works but Mapreduce does not

3 Answers3