0

I'm trying to find the longest word in the text file through letter a->z.

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word[0].lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, _, word_count_pairs):
        longest_word = ''
        for word in word_count_pairs:
            if len(word) > len (longest_word):
                longest_word = word
        yield max(longest_word)

if __name__ == '__main__':
    MRWordFreqCount.run()

The out put should be something like this but I'm getting stuck here

"r" ["recommendations", "representations"]

"s" ["superciliousness"]
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Phat Phat
  • 1
  • 1

1 Answers1

0

Your mapper is currently outputting only the first character of each word.

Your combiner is then counting how many words start with that letter... That's not going to help find a max of the whole word.


Part of the problem - max() only works on numbers only returns one value, so won't help find longest words that are all the same length

If you don't care about the leading letters, then mapreduce isn't really beneficial since you would need to force all words into one reducer- for example below. Also, this is not recommended approach for very large files

def mapper(self, _, line):
    for word in WORD_RE.findall(line):
        yield None, word

def reducer(self, _, words):
    lst = list(words)  # copy out iterator to in memory list 
    lens = max(len(w) for w in words)
    max_words = [w for w in words if len(w) == max_words] 
    yield None, max_words 

The alternative strategy to above is to find the max lengths words per letter, then after that, if you want to find the overall max, pass the output to a secondary mapreduce job

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245