0

I am trying to make a job that takes in a text file, then counts the number of syllables in each word, then ultimately returns the top 10 words with the most syllables. I'm able to get all of the word/syllable pairs sorted in descending order, however, I am struggling to figure out how to return only the top 10 words. Here's my code so far:

from mrjob.job import MRJob
from mrjob.step import MRStep
import re
WORD_RE = re.compile(r"[\w']+")

class MRMostUsedWordSyllables(MRJob):
    
    def steps(self):
        return [
            MRStep(mapper=self.word_splitter_mapper,
                   reducer=self.sorting_word_syllables),
            MRStep(reducer=self.reducer_word_sorted),
            MRStep(reducer=self.get_top_10_reducer)
        ]
    
    def word_splitter_mapper(self, _, line):
        #for word in line.split():
        for word in WORD_RE.findall(line):
            yield(word.lower(), None)
        
    def sorting_word_syllables(self, word, count):
        count = 0
        vowels = 'aeiouy'
        word = word.lower().strip()
        if word in vowels:
            count +=1
        for index in range(1,len(word)):
            if word[index] in vowels and word[index-1] not in vowels:
                count +=1
        if word.endswith('e'):
            count -= 1
        if word.endswith('le'):
            count+=1
        if count == 0:
            count +=1
        yield None, (int(count), word)
    
    
    
    def reducer_word_sorted(self, _, syllables_counts):
        for count, word in sorted(syllables_counts, reverse=True):
            yield (int(count), word)
            
    def get_top_10_reducer(self, count, word):
        self.aList = []
        for value in list(range(count)):
            self.aList.append(value)
        self.bList = []
        for i in range(10):
            self.bList.append(max(self.aList))
            self.aList.remove(max(self.aList))
        for i in range(10):
            yield self.bList[i]


if __name__ == '__main__':
   import time
   start = time.time()
   MRMostUsedWordSyllables.run()
   end = time.time()
   print(end - start)

I know my issue is with the "get_top_10_reducer" function. I keep getting ValueError: max() arg is an empty sequence.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Based on the error, `list(range(count))` is empty. What debugging have you done? You shouldn't need Hadoop to test this code, by the way – OneCricketeer Mar 07 '22 at 20:33
  • Hi @OneCricketeer, appreciate the response! I have tried a handful of different ways, but I feel this is the closest I have gotten. Yeah, I noticed that, which is weird because when I run this without the "top_10_reducer" it returns all the key/value pairs, so it's weird it keeps coming back empty. I feel like I missing something small, but fundamental here – dimension_dweller Mar 07 '22 at 21:56
  • What are you expecting `count` to be? And why not do `self.aList = [x for x in range(count)]`? And why are you trying to remove/append between A and B lists? – OneCricketeer Mar 07 '22 at 22:00

1 Answers1

0

According to the error, one of your reducers has returned 0 for the count. Do you have an empty line in your input, for example? You should filter this data out as early as possible.


Overall, I think you need to remove reducer_word_sorted. There is no guarantee this returns sorted data. Instead, I think it regroups all data based on the numeric count key, then emits in a non-deterministic order to the next step.

That being said, your top 10 reducer is never using the value of word parameter , which should be a list itself, actually, grouped by each count key emitted by the previous reducer.

With the reducer_word_sorted removed, the sorting_word_syllables returns None for its key... This is fine because you then have all split words in a giant list, so define a regular function

def get_syllable_count_pair(word):
  return (syllables(word), word, )

Use that within the reducer

def get_top_10_reducer(self, count, word):
  assert count == None  # added for a guard
  with_counts = [get_syllable_count_pair(w) for w in word]
  # Sort the words by the syllable count
  sorted_counts = sorted(syllables_counts, reverse=True, key=lambda x: x[0])
  # Slice off the first ten
  for t in sorted_counts[:10]: 
    yield t
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245