I have just started using mrJob (mapReduce for python) and am new to the MapReduce paradigm, I would like to know the following about the word_count.py tutorial that is present on the MRJob documentation site.
The docs say that if we create a word_count.py and run it with some text file, it will calculate and return a count of the lines, chars and words in the text file. Here is the code they use for word_count.py:
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()
Here I understand that we extend the MRJob class and override the mapper and reducer methods. But what I don't get is during execution, we are executing by passing the entire text file as follows:
python word_count.py entire_text_file.txt
so how does the mapper know how to parse it one line at a time? Basically my question is in this case what will the input to the mapper() function defined above be? Will it be the contents of the entire file altogether or a single line at a time. And if it is a single line what part of the MRJob code takes care of supplying a single line at a time to the mapper() function. Hope I have made my originally vague question less vague, but this has me completely stumped. Any help would be appreciated.
Thanks in advance!