I'm trying to write a MapReduce program for computing Trigrams using the mrjob framework in Python. So far, this is what I have:
from mrjob.job import MRJob
class MRTrigram(MRJob):
def mapper(self, _, line):
w = line.split()
for idx,word in enumerate(w):
if idx < len(w) - 2:
# Generate a trigram using the current word and next 2 words
trigram = w[idx] + " " + w[idx + 1] + " " + w[idx + 2]
yield trigram, 1
def reducer(self, key, values):
yield sum(values), key
# ignore this part - its just standard bolierplate for mrjob!
if __name__ == '__main__':
MRTrigram.run()
As it can be seen, I've not handled the case where a trigram is split across lines (say, "it was" at the end of line 3, "the best of times" at beginning of line 4 - but my code would not capture the trigram "it was the" in this case!).
How do I go about preserving states across multiple map calls, ensuring that no matter however the mappers are assigned jobs by the underlying runtime, only trigrams across consecutive lines are counted? I thought of storing the last 2 words of each line in a persistent data structure inside the MRTrigram class, but then I realized I could not guarantee if I was comparing words across lines i and i+1 (and not lines i, j, where j can be line anywhere in the document!).
Any ideas to set me on the right track?