MapReduce job to yield top 10 values using Python's MRjob

Question

I want this map reduce job (code below) to output the top 10 most rated products. It keeps giving me the following error message:

it = izip(iterable, count(0,-1)) # decorate TypeError: izip argument #1 must support iteration.

I'm thinking it has to do with the nlargest function I am trying to apply.

Any pointers?

Thank you!

from mrjob.job import MRJob
from mrjob.step import MRStep
from heapq import nlargest


class MostRatedProduct(MRJob):

def steps(self):
    return [
        MRStep(mapper = self.mapper_get_ratings,
               reducer = self.reducer_count_ratings),
        MRStep(reducer = self.reducer_find_top10)
    ]


def mapper_get_ratings(self, _, line):
    (userID, itemID, rating, timestamp) = line.split(',')
    yield itemID, 1

def reducer_count_ratings(self, itemID, ratingCount):
    yield None, (sum(ratingCount), itemID)

def top_10(self, ratingPair):
    for ratingTotal, itemID in ratingPair:
        top_rated = nlargest(10, ratingTotal)
    for top_rated in ratingTotal:
        return (ratingTotal, itemID)

def reducer_find_top10(self, key, ratingPair):
    ratingTotal, itemID = self.top_10(ratingPair)
    yield ratingTotal, itemID


if __name__ == '__main__':
    MostRatedProduct.run()

score 2 · Answer 1 · answered Aug 08 '19 at 14:47

Using the mrjob library, you can do the same in python:-

#Write a Code to print the top 5 word - occurences

#Import Dependencies
from mrjob.job import MRJob
from mrjob.step import MRStep

class MRWordCount(MRJob):

  def steps(self):
    return [MRStep(mapper=self.mapper,reducer=self.reducer),MRStep(reducer = self.secondreducer)]

  def mapper(self,_,lines):
    words = lines.split()
    for word in words:
      yield word.lower(),1

  def reducer(self,key,values):
    yield None,('%04d'%int(sum(values)),key)

  def secondreducer(self,key,values):
    self.alist = []
    for value in values:
      self.alist.append(value)
    self.blist = []
    for i in range(5):
      self.blist.append(max(self.alist))
      self.alist.remove(max(self.alist))
    for i in range(5):
      yield self.blist[i]

if __name__ == '__main__':
    MRWordCount.run()

score 1 · Answer 2 · answered Nov 29 '16 at 16:52

I haven't used mrjob but I have used MapReduce on the AWS cluster to find top values before. Here is my code, which doesn't use heapq. Hopefully you are able to apply the same concept to your code. Here is the mapper function

import sys, time

def Parser():
    for line in sys.stdin:
        line = line.strip('\n')
        yield line.split()


def mapper():
    counts = list(Parser())
    z = sorted(counts, key = lambda x: int(x[1]))[-10:]
    print '\n'.join(map(lambda x: '\t'.join(x), z))


if __name__=='__main__':
    mapper()

Here is the code for the reducer

import sys, operator, itertools

def Parser():
    for line in sys.stdin:
        yield tuple(line.strip('\n').split('\t'))

def reducer():
    for key, pairs in itertools.groupby(Parser(), operator.itemgetter(0)):
        counts = list(Parser())
        z = sorted(counts, key = lambda x: int(x[1]))[-10:]
        print '\n'.join(map(lambda x: '\t'.join(x), z))

if __name__=='__main__':
    reducer()

I changed it to output the top 10 words. Keep in mind this is a word count example where I parsed a text document. I hope this helps in some way!

MapReduce job to yield top 10 values using Python's MRjob

2 Answers2