1

I am new to python programming so excuse me in advance if I ask something that is easily solved. I want to use MapReduce for processing a csv file that has some values and the output must return the maximum value.This is the script i've written so far:

from mrjob.job import MRJob

class MRWordCounter(MRJob):
def mapper(self, key, line):
    for word in line.split(','):
        yield 'MAXIMUM VALUE IN FILE:',int(word)


def reducer(self, word, occurrences):
    yield word, max(occurrences)


if __name__ == '__main__':
     MRWordCounter.run()

Now, the script works fine, it maps and reduces to the maximum value and prints it as an output but I think the way I implement it with the yield 'MAXIMUM VALUE IN FILE:' is incorrect since the mapper always sends that string to the reducer. Can someone confirm if that is the incorrect way to implement it and recommend me how I can fix it?

Gyanendra Dwivedi
  • 5,511
  • 2
  • 27
  • 53
Kyr
  • 31
  • 5

1 Answers1

1

Your approach is correct. As you mentioned, the mapper always sends MAXIMUM VALUE IN FILE: as the only key to the reducer, which means it is not relevant for the job in this stage. Remember that the mapper only does some bridge operations towards the final goal. Don't take this as a standard, but in my opinion, in terms of readability of your code, the values mapped are not the maximum value in file, therefore they should not be labeled with the key MAXIMUM VALUE IN FILE:. Only the reducer knows which is the maximum number, so that answer should be wrapped up by the reducer, labeling the final result.

In that case you can just send None as a key from the mapper, and then add to the output of the reducer whatever you think describes better the final result, in this case, the maximum number.

I would suggest this approach instead. (I took the liberty of changing some variable names to clarify what the code does)

from mrjob.job import MRJob


class MRFindMax(MRJob):

  def mapper(self, _, line):
    for number in line.split(','):
      yield None, int(number)

  # Discard key, because it is None
  # After sort and group, the output is only one key-value pair (None, <all_numbers>)
  def reducer(self, _, numbers):
    yield "Max value is", max(numbers)


if __name__ == '__main__':
  MRFindMax.run()

I hope you find this answer useful for writing not only correct code as yours, but code that you feel more comfortable with.

ekauffmann
  • 150
  • 10