0
from mrjob.job import job


class KittyJob(MRJob):

    OUTPUT_PROTOCOL = JSONValueProtocol

    def mapper_cmd(self):
        return "grep kitty"

    def reducer(self, key, values):
        yield None, sum(1 for _ in values)


if __name__ == '__main__':
    KittyJob().run()

Source : https://mrjob.readthedocs.org/en/latest/guides/writing-mrjobs.html#protocols

How does this code do its task of counting the number of lines containing kitty?

Also where is OUTPUT_PROTOCOL defined?

Ankur Agarwal
  • 23,692
  • 41
  • 137
  • 208

1 Answers1

1

Well, the short answer is that this example doesn't count lines containing 'kitty'.

Here is some code using filters that does count lines containing (case-insensitive) kitty:

from mrjob.job import MRJob
from mrjob.protocol import JSONValueProtocol
from mrjob.step import MRStep

class KittyJob(MRJob):
    OUTPUT_PROTOCOL = JSONValueProtocol

    def mapper(self, _, line):
        yield 'kitty', 1

    def sum_kitties(self, key, values):
        yield None, sum(values)

    def steps(self):
        return [
            MRStep(mapper_pre_filter='grep -i "kitty"',
                   mapper=self.mapper,
                   reducer=self.sum_kitties)]

if __name__ == '__main__':
    KittyJob().run()

If I run it using the local runner as noted in Shell Commands as Steps over the text of the english wikipedia page for 'Kitty', then I get a count of all lines containing 'kitty' as expected:

$ python grep_kitty.py -q -r local kitty.txt
20
$ grep -ci kitty kitty.txt
20

It looks like the example you cite from the mrjob docs is just wrong.

jeffmcc
  • 263
  • 3
  • 9