0
def mapper(self, _, line):
    stop_words = set(["to", "a", "an", "the", "for", "in", "on", "of", "at", "over", "with", "after", "and", "from", "new", "us", "by", "as", "man", "up", "says", "in", "out", "is", "be", "are", "not", "pm", "am", "off", "more", "less", "no", "how"])
    (date,words) = line.strip().split(",")

    word_list = words.split()
    clean_words = [word for word in word_list if word not in stop_words]
    clean_words.sort()


    yield (date[0:4],clean_words)

This is in a MRJob mapper. Current output looks like

"2003" ["word 1","word 2", "word 3", "word 4"]
"2004" ["word 1","word 2", "word 3", "word 4"]

What I would like it to be is:

"2003" "Word 1"
"2003" "Word 2"
"2004" "Word 3"
"2004" "Word 4"

Once it is output like this i can send to the reducer to perform a count on the year and top3 words

Barmar
  • 741,623
  • 53
  • 500
  • 612
CKZ
  • 37
  • 5

1 Answers1

0

Use a loop to yield each word separately.

def mapper(self, _, line):
    stop_words = set(["to", "a", "an", "the", "for", "in", "on", "of", "at", "over", "with", "after", "and", "from", "new", "us", "by", "as", "man", "up", "says", "in", "out", "is", "be", "are", "not", "pm", "am", "off", "more", "less", "no", "how"])
    (date,words) = line.strip().split(",")

    word_list = words.split()
    clean_words = [word for word in word_list if word not in stop_words]
    clean_words.sort()

    for word in clean_words:
        yield (date[0:4],word)
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Thank you so much @Barmar - i knew I had to iterate somewhere!!! too tired to work it out. Cheers – CKZ Nov 16 '21 at 22:18