MapReduce: How to keep track of states across multiple lines in the mapper (say for counting trigrams)?

Question

I'm trying to write a MapReduce program for computing Trigrams using the mrjob framework in Python. So far, this is what I have:

from mrjob.job import MRJob

class MRTrigram(MRJob):

    def mapper(self, _, line):
        w = line.split()
        for idx,word in enumerate(w):
            if idx < len(w) - 2:
                # Generate a trigram using the current word and next 2 words
                trigram = w[idx] + " " + w[idx + 1] + " " + w[idx + 2]
                yield trigram, 1

    def reducer(self, key, values):
        yield sum(values), key

# ignore this part - its just standard bolierplate for mrjob!
if __name__ == '__main__':
    MRTrigram.run()

As it can be seen, I've not handled the case where a trigram is split across lines (say, "it was" at the end of line 3, "the best of times" at beginning of line 4 - but my code would not capture the trigram "it was the" in this case!).

How do I go about preserving states across multiple map calls, ensuring that no matter however the mappers are assigned jobs by the underlying runtime, only trigrams across consecutive lines are counted? I thought of storing the last 2 words of each line in a persistent data structure inside the MRTrigram class, but then I realized I could not guarantee if I was comparing words across lines i and i+1 (and not lines i, j, where j can be line anywhere in the document!).

Any ideas to set me on the right track?

Really, no-one, on a Monday, on a question with MWE and a clear query? Maybe its not as simple a confusion as I thought! :) — TCSGrad, Mar 03 '14 at 18:47
I am interested in this too. Every implementation online seems to ignore this issue and the answer below isn't helpful. — Daniel Parry, Feb 08 '16 at 01:06

score 0 · Answer 1 · answered Apr 17 '14 at 17:09

You might get a hint as to how this could be done by writing a custom protocol, but I believe mrjob takes stream input delimited by the new line character before you can add a customized behavior (i.e., forming key and value), so it might not be possible with mrjob.

If you are using Hadoop (i.e., native Java), then you can write a custom input format that takes multiline text and parse a key-value pair out of it.

MapReduce: How to keep track of states across multiple lines in the mapper (say for counting trigrams)?

1 Answers1