mrJob python mapReduce word_count.py

Question

I have just started using mrJob (mapReduce for python) and am new to the MapReduce paradigm, I would like to know the following about the word_count.py tutorial that is present on the MRJob documentation site.

The docs say that if we create a word_count.py and run it with some text file, it will calculate and return a count of the lines, chars and words in the text file. Here is the code they use for word_count.py:

from mrjob.job import MRJob


class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

Here I understand that we extend the MRJob class and override the mapper and reducer methods. But what I don't get is during execution, we are executing by passing the entire text file as follows:

python  word_count.py  entire_text_file.txt

so how does the mapper know how to parse it one line at a time? Basically my question is in this case what will the input to the mapper() function defined above be? Will it be the contents of the entire file altogether or a single line at a time. And if it is a single line what part of the MRJob code takes care of supplying a single line at a time to the mapper() function. Hope I have made my originally vague question less vague, but this has me completely stumped. Any help would be appreciated.

Thanks in advance!

zhutoulala · Answer 1 · 2013-11-14T04:15:04.040

well, i guess the best answer is RTFC :P

If you look into /usr/lib/python2.6/site-packages/mrjob/job.py (given you installed mrjob with pip on python2.6), you will find how it exactly reads lines from the input and run mapper for each line

def run_mapper(self, step_num=0):
    ...

    # pick input and output protocol
    read_lines, write_line = self._wrap_protocols(step_num, 'mapper')

    if mapper_init:
        for out_key, out_value in mapper_init() or ():
            write_line(out_key, out_value)

    # run the mapper on each line
    for key, value in read_lines():
        for out_key, out_value in mapper(key, value) or ():
            write_line(out_key, out_value)

    if mapper_final:
        for out_key, out_value in mapper_final() or ():
            write_line(out_key, out_value)

Here the definition of read_lines

def _wrap_protocols(self, step_num, step_type):
    """Pick the protocol classes to use for reading and writing
    for the given step, and wrap them so that bad input and output
    trigger a counter rather than an exception unless --strict-protocols
    is set.

    Returns a tuple of ``(read_lines, write_line)``

    ``read_lines()`` is a function that reads lines from input, decodes
        them, and yields key, value pairs.
    ``write_line()`` is a function that takes key and value as args,
        encodes them, and writes a line to output.

    :param step_num: which step to run (e.g. 0)
    :param step_type: ``'mapper'``, ``'reducer'``, or ``'combiner'`` from
                      :py:mod:`mrjob.step`
    """
    read, write = self.pick_protocols(step_num, step_type)

    def read_lines():
        for line in self._read_input():
            try:
                key, value = read(line.rstrip('\r\n'))
                yield key, value
            except Exception, e:
                if self.options.strict_protocols:
                    raise
                else:
                    self.increment_counter(
                        'Undecodable input', e.__class__.__name__)

    def write_line(key, value):
        try:
            print >> self.stdout, write(key, value)
        except Exception, e:
            if self.options.strict_protocols:
                raise
            else:
                self.increment_counter(
                    'Unencodable output', e.__class__.__name__)

    return read_lines, write_line

Ultimately, you can read the read_input and read_file methods in /usr/lib/python2.6/site-packages/mrjob/util.py. hopefully it helps

thanks for your response it was very helpful. What does the 'K' mean? in RTKC? :p. I've heard RTFC and RTFM but never RTKC :p — anonuser0428, Nov 14 '13 at 04:03

mrJob python mapReduce word_count.py

1 Answers1