Running MapReduce from Jupyter Notebook

Question

I am trying to run MapReduce from Jupyter Notebook on a dataset in u.data file, but I keep receiving an error message that says

"TypeError: 'str' object doesn't support item deletion".

How can I make the code runs successfully?

The u.data contains information like the following:

196 242 3   881250949
186 302 3   891717742
22  377 1   878887116
244 51  2   880606923
166 346 1   886397596
298 474 4   884182806
115 265 2   881171488
253 465 5   891628467
305 451 3   886324817
6   86  3   883603013

And here is the code:

from mrjob.job import MRJob

class MRRatingCounter(MRJob):
    def mapper(self, key, line):
        (userID, movieID, rating, timestamp) = line.split("\t")
        yield rating, 1

    def reducer(self, rating, occurences):
        yield rating, sum(occurences)

if __name__ == "main__":
    MRRatingCounter.run()

filepath = "u.data"

MRRatingCounter(filepath)

This code runs successfully if it saves under .py file, and uses a command line: !python ratingCounter.py u.data

score 0 · Answer 1 · answered Mar 17 '17 at 17:23

MRRatingCounter needs to exist in own .py file, let's say MRRatingCounter.py:

from mrjob.job import MRJob

class MRRatingCounter(MRJob):

    def mapper(self, key, line):
        (userID, movieID, rating, timestamp) = line.split("\t")
        yield rating, 1

    def reducer(self, rating, occurences):
        yield rating, sum(occurences)

if __name__ == "__main__":
    MRRatingCounter.run()

Import the class into your notebook and execute it through the runner:

from MRRatingCounter import MRRatingCounter

mr_job = MRRatingCounter(args=['u.data'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        #handle each line however you like
        print line

score 0 · Answer 2 · answered Jun 27 '19 at 16:33

Like you mentioned the important part is to have the file saved in .py format and for that you have to include %%file filename.py

In this case I have added rc.py as my filename and all the code goes into a single cell:

%%file rc.py
from mrjob.job import MRJob
class MRRatingCounter(MRJob):
    def mapper(self, key, line):
        (userId, movieId, rating, timestamp) = line.split('\t')
        yield rating, 1

    def reducer(self, rating, occurances):
        yield rating, sum(occurances)

if __name__ == '__main__':
    MRRatingCounter.run()

Once you run the cell, in the next cell you can run the following:

!python rc.py u.data

This will give you the output you're looking for.

Running MapReduce from Jupyter Notebook

2 Answers2