How to implement mapreduce pairs pattern in python

Question

I am trying to attempt the mapreduce pairs pattern in python. Need to check if a word is in a text file and then find the word next to it and yield a pair of both words. keep running into either:

neighbors = words[words.index(w) + 1]
ValueError: substring not found

or

 ValueError: ("the") is not in list

file cwork_trials.py

from mrjob.job import MRJob

class MRCountest(MRJob):
    # Word count
    def mapper(self, _, document):
        # Assume document is a list of words.
        #words = []
        words = document.strip()

        w = "the"
        neighbors = words.index(w)
        for word in words:
            #searchword = "the"
            #wor.append(str(word))
            #neighbors = words[words.index(w) + 1]
            yield(w,1)

    def reducer(self, w, values):
        yield(w,sum(values))

if __name__ == '__main__':
    MRCountest.run()

Edit: Trying to use the pairs pattern to search a document for every instance of a specific word and then find the word next to it each time. Then yielding a pair result for each instance i.e. find instances of "the" and the word next to it i.e. [the], [book], [the], [cat] etc.

from mrjob.job import MRJob

class MRCountest(MRJob):
# Word count
def mapper(self, _, document):
    # Assume document is a list of words.
    #words = []
    words = document.split(" ")

    want = "the"
    for w, want in enumerate(words, 1):
        if (w+1) < len(words):
            neighbors = words[w + 1]
            pair = (want, neighbors)
            for u in neighbors:
                if want is "the":
                    #pair = (want, neighbors)
                    yield(pair),1
    #neighbors = words.index(w)
    #for word in words:

        #searchword = "the"
        #wor.append(str(word))
        #neighbors = words[words.index(w) + 1]
        #yield(w,1)

#def reducer(self, w, values):
    #yield(w,sum(values))

if __name__ == '__main__':
MRCountest.run()

As it stands I get yields of every word pair with multiples of the same pairing.

There is no requested input. The document should be searched for a specific word such as "the" in the code. Expected result is a pair consisting of instance the search term i.e. "the" appears along with the word that immediately follows i.e. the bird, the book, the house and so on. — Jackob, Dec 13 '17 at 15:36

score 1 · Answer 1 · answered Dec 12 '17 at 14:25

When you use words.index("the") then you will only get the first instance of "the" in your list or string, and as you have found, you will get an error if "the" isn't present.

Also you mention that you are trying to produce pairs, but only yield a single word.

I think what you are trying to do is something more like this:

def get_word_pairs(words):
    for i, word in enumerate(words):
        if (i+1) < len(words):
            yield (word, words[i + 1]), 1
        if (i-1) > 0:
            yield (word, words[i - 1]), 1

assuming you are interested in neighbours in both directions. (If not, you only need the first yield.)

Lastly, since you use document.strip(), I suspect that document is in fact a string and not a list. If that's the case, you can use words = document.split(" ") to get the word list, assuming you don't have any punctuation.

Hey. Tried this method, but now all my yielded results are numerical. What I am trying to get is a yield of all pairs containing my searchword "the". — Jackob, Dec 14 '17 at 09:51
To clarify, I am trying to implement: for all instances of "the" in the document find the word neighbouring it and yield each pair and a count. This should be done using a nested for loop. — Jackob, Dec 14 '17 at 09:56

How to implement mapreduce pairs pattern in python

1 Answers1