0

mapper.py is working fine. I ran mapper.py on my cluster and stored its output in part-0.txt.

Excatly like a word-count job, I am trying to count the occurrences of every distinct key stored in part-0.txt file.

I tried copy-pasting the code from this link: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

It worked but I was unable to understand it's reducer code, so I wrote my own reducer.

And here is the reducer code:

#!/usr/bin/env python
from numpy import *
import sys

arr = []
previous_printed_word = ''
#f=open('/home/nalin/Downloads/part-0.txt','r')

for line in sys.stdin:
    line = line.strip()
    current_word, current_count = line.split('\t',1)
    current_count = 0

    if(previous_printed_word != current_word):
        #f2 = open('/home/nalin/Downloads/part-0.txt', 'r')
        for line2 in sys.stdin:
            line2 = line2.strip()
            word, count2 = line2.split('\t', 1)
            count2 = int(count2)
            if current_word == word:
                current_count = current_count + count2
            else:
                continue
        print '%s\t\t\t%d' % (current_word, current_count-1)
        arr.append ( [current_word, current_count-1] )
        previous_printed_word = current_word

arr = sorted(arr, key=lambda row: row[1])
#print arr
length=len(arr)
print "LENGHT OF 2-D ARRAY IS = ",length
for i in range(1,11):
    print arr[length-i]

I keep getting this error:

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

I tried looking up what this error means and I found out that this error comes up when there is something wrong with the code.

But if I uncomment out these 2 lines:

f = open('/home/nalin/Downloads/part-0.txt', 'r')

f2 = open('/home/nalin/Downloads/part-0.txt', 'r')

put f & f2 instead of sys.stdin(on both occurrences of sys.stdin) then it works like a charm.

It is working when I run it on mappers output file. It is not working when I run it on my cluster.

Help me figure out what is wrong in the code.

Community
  • 1
  • 1
bhoots21304
  • 47
  • 11
  • How are you executing the job? – franklinsijo Feb 23 '17 at 06:09
  • using this command: bin/hadoop \ jar /home/nalin/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \ -mapper "python /home/nalin/PycharmProjects/ForHadoop/ml_1n_mapper.py" \ -reducer "python /home/nalin/PycharmProjects/ForHadoop/ml_1n_reducer.py" \ -input "/input4/ratings.dat" \ -output "wordcount12" – bhoots21304 Feb 23 '17 at 08:14

0 Answers0