1

I'm new to MRJob and MR and I was wondering in the traditional word count python example for MRJob MR:

from mrjob.job import MRJob

class MRWordCounter(MRJob):
    def mapper(self, key, line):
        for word in line.split():
            yield word, 1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCounter.run()

is it possible to store the word, sum(occurrences) tuples into a dictionary instead of yielding them, so I can access them later? what would be the syntax to do this? Thanks!

Michael
  • 7,087
  • 21
  • 52
  • 81

2 Answers2

2

You could simply use list instead of yield:

from mrjob.job import MRJob

class MRWordCounter(MRJob):
    def mapper(self, key, line):
        results = []
        for word in line.split():
            results.append((word, 1)) <-- Note that the list should append a tuple here.
        return results

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)

if __name__ == '__main__':
    MRWordCounter.run()
WoooHaaaa
  • 19,732
  • 32
  • 90
  • 138
0

Keep in mind that the job you've got will be run on another server. Inputs and outputs are treated as problems to be managed by the script that runs your module.

If you want to use the output of your job, you'll need to either read it from wherever you've written out to (it defaults to standard out) or run the job programmatically.

It sounds like you want the latter. In a separate module, you'll want to do something like:

mr_job = MRWordCounter(args=['-r', 'emr'])
with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)
        ... # do something with the parsed output

check out the docs for more details. The code sample above was taken from: http://pythonhosted.org/mrjob/guides/runners.html#runners-programmatically

thetainted1
  • 451
  • 3
  • 4