1

I am beginning to learn MapReduce with the mrjob python package. mrjob documentation lists the following snippet as an example MapReduce script.

"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))


 if __name__ == '__main__':
     MRWordFreqCount.run()

I understand how this algorithm generally works, what the combiner (which is not required to run) does, and how reducers run on shuffled and sorted values from the mappers and combiners.

However, I do not understand how the reducers come up with a single value. Aren't there different reduce processes running on different nodes of a cluster? How do these reduce functions come up with a single answer if only certain shuffled key-values pairs are sent to certain reducers by the partitioners?

I guess I'm confused about how the output from various reducers are combined into a single answer.

dangerChihuahua007
  • 20,299
  • 35
  • 117
  • 206

2 Answers2

2

Basically, all the values which have the same key go to a single reducer. So even if there are multiple reducers, each reducer has all the data required for one single key.

Read Q
  • 1,405
  • 2
  • 14
  • 26
0

The short answer is they don't. As you correctly notice, all results would have to be sent to a single reducer, to get a single result.

You should generally expect to do some post-processing of the output of your map-reduce job. The job does the heavy chrunching, but each reducer outputs individual results.

You would normally do your processing in a different environment, but more often than not, I simply end up adding an extra job (taking the output of the first job as input) with an identity mapper (that does no processing of the data) that emits everything to a single reducer (all values emitted by the mapper share the same key). This reducer can then do the final aggregation of results. This may not always be an efficient and fast solution to getting an aggregated result, but sometimes, the overhead is small enough, that it's easier to just keep everything in 1 mrjob class.

jkgeyti
  • 2,344
  • 18
  • 31