I have to join 6 sets of data pertaining to the amount of views for certain TV shows on various channels. 3 of the 6 sets of data contain a list of shows and the amount of views of each, e.g:
Show_Name 201
Another_Show 105
and so on...
The other 3 sets of data contain the shows and the channels on which each one is aired, e.g:
Show_Name ABC
Another_Show CNN
and so on...
I wrote the following Mapper in python to find on ABC channel:
#!/usr/bin/env python
import sys
all_shows_views = []
shows_on_ABC = []
for line in sys.stdin:
line = line.strip() #strip out carriage return (i.e. removes line breaks).
key_value = line.split(",") #split line into key and value, returns a list.
key_in = key_value[0] #.split(" ") - Dont need the split(" ") b/c there is no date.
value_in = key_value[1] #value is 2nd item.
if value_in.isdigit():
show = key_in
all_shows_views.append(show + "\t" + value_in)
if value_in == "ABC": #check if the TV Show is ABC.
show = key_in
shows_on_ABC.append(show)
for i in range(len(all_shows_views)):
show_view = all_shows_views[i].split("\t")
for c in range(len(shows_on_ABC)):
if show_view[0] == shows_on_ABC[c]:
print (show_view[0] + "\t" + show_view[1])
#Note that Hadoop expects a tab to separate key value
#but this program assumes the input file has a ',' separating key value.
The Mapper passes only the name of the show on ABC and the amount of views, e.g:
Show_name_on_ABC 120
The reducer, also in python, is as follows:
prev_show = " " #initialize previous word to blank string
line_cnt = 0 #count input lines.
count = 0 #keep running total.
for line in sys.stdin:
line = line.strip() #strip out carriage return
key_value = line.split('\t') #split line, into key and value, returns a list
line_cnt = line_cnt+1
curr_show = key_value[0] #key is first item in list, indexed by 0
value_in = key_value[1] #value is 2nd item
if curr_show != prev_show and line_cnt>1:
#print "\n"
#print "---------------------Total---------------------"
#print "\n"
print (prev_show + "\t" + str(count))
#print "\n"
#print "------------------End of Item------------------"
#print "\n"
count = 0
else:
count = count + int(key_value[1])
#print key_value[0] + "\t" + key_value[1]
prev_show = curr_show #set up previous show for the next set of input lines.
print (curr_show + "\t" + str(count))
The reducer takes the list of shows on ABC with the number of views and keeps a counting average of each and prints out the total of each show (hadoop automatically orders the data in alphabetical order according to the key which is the name of the show in this case).
When I run this in the using the piping command in the Terminal as follwos:
cat Data*.text | /home/cloudera/mapper.py |sort| /home/coudera/reducer.py
I get a neat output with the correct totals as follows:
Almost_Games 49237
Almost_News 45589
Almost_Show 49186
Baked_Games 50603
When I run this problem in using the Hadoop command in the Terminal using the following command:
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_join \
-mapper /home/cloudera/mapper.py \
-reducer /home/cloudera/reducer.py
I get an unsuccessful error with the reducer being the culprit. The full error is as follows:
15/11/15 09:16:54 INFO mapreduce.Job: Job job_1447598349691_0003 failed with state FAILED due to: Task failed task_1447598349691_0003_r_000000
Job failed as tasks failed. failedMaps:0 failedReduces:1
15/11/15 09:16:54 INFO mapreduce.Job: Counters: 37
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=674742
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=113784
HDFS: Number of bytes written=0
HDFS: Number of read operations=18
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed reduce tasks=4
Launched map tasks=6
Launched reduce tasks=4
Data-local map tasks=6
Total time spent by all maps in occupied slots (ms)=53496
Total time spent by all reduces in occupied slots (ms)=18565
Total time spent by all map tasks (ms)=53496
Total time spent by all reduce tasks (ms)=18565
Total vcore-seconds taken by all map tasks=53496
Total vcore-seconds taken by all reduce tasks=18565
Total megabyte-seconds taken by all map tasks=54779904
Total megabyte-seconds taken by all reduce tasks=19010560
Map-Reduce Framework
Map input records=6600
Map output records=0
Map output bytes=0
Map output materialized bytes=36
Input split bytes=729
Combine input records=0
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=452
CPU time spent (ms)=4470
Physical memory (bytes) snapshot=1628909568
Virtual memory (bytes) snapshot=9392836608
Total committed heap usage (bytes)=1279262720
File Input Format Counters
Bytes Read=113055
15/11/15 09:16:54 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!
Why would the piping command work and not the hadoop execution?