3

Is there any way to sort the output of reducer function using mrjob?

I think that the input to reducer function is sorted by the key and I tried to exploit this feature to sort the output using another reducer like below where I know values have numeric values, I want to count number of each key and sort keys according to this count:

def mapper_1(self, key, line):
    key = #extract key from the line
    yield (key, 1)

def reducer_1(self, key, values):
    yield key, sum(values)

def mapper_2(self, key, count):
    yield ('%020d' % int(count), key)

def reducer_2(self, count, keys):
    for key in keys:
        yield key, int(count)

but it's output is not correctly sorted! I suspected that this weird behavior is due to manipulating ints as string and tried to format it as this link says but It didn't worked!

IMPORTANT NOTE: When I use the debugger to see the order of output of reducer_2 the order is correct but what is printed as output is something else!!!

IMPORTANT NOTE 2: On another computer the same program on the same data returns output sorted as expected!

Dandelion
  • 744
  • 2
  • 13
  • 34

1 Answers1

8

You can sort the values as integers in second reducer and then converting them in to the zero padded representation:

import re

from mrjob.job import MRJob
from mrjob.step import MRStep

WORD_RE = re.compile(r"[\w']+")


class MRWordFrequencyCount(MRJob):

    def steps(self):
        return [
            MRStep(
                mapper=self.mapper_extract_words, combiner=self.combine_word_counts,
                reducer=self.reducer_sum_word_counts
            ),
            MRStep(
                reducer=self.reduce_sort_counts
            )
        ]

    def mapper_extract_words(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combine_word_counts(self, word, counts):
        yield word, sum(counts)

    def reducer_sum_word_counts(self, key, values):
        yield None, (sum(values), key)

    def reduce_sort_counts(self, _, word_counts):
        for count, key in sorted(word_counts, reverse=True):
            yield ('%020d' % int(count), key)

Well this is sorting the output in memory, which migtht be a problem depending on the size of the input. But you want it sorted so it has to be sorted somehow.

Tomasz Swider
  • 2,314
  • 18
  • 22
  • Thank you, that worked! But one point is remaining, why my code is producing different outputs on different machines! My code sorted the output on all computers except mine!!! – Dandelion Dec 19 '18 at 04:05
  • What are the operating systems ?, versions of MRJob, python etc ? – Tomasz Swider Dec 19 '18 at 05:18
  • I am using MacOs and python 3.5 and mrjob 0.6.6. It was sorted on multiple windows using anaconda which I don't know the version of it! – Dandelion Dec 19 '18 at 05:26
  • And the other mashines that it was sorting on? :) – Tomasz Swider Dec 19 '18 at 05:27
  • 1
    I have ubuntu 16, python 3.6.2 and mrjob 0.6.6 and it was not sorting for me. there is a comment on the mrjob question linked by you, suggesting that it should be sorted after combine step but only for small inputs, whatever small means. – Tomasz Swider Dec 19 '18 at 05:39
  • I think that I saw somewhere in the documentations that keys are passed to the reducer in the sorted order but I cannot find it!!!! This post seems relevant: https://stackoverflow.com/questions/42078886/why-is-mrjob-sorting-my-keys – Dandelion Dec 19 '18 at 05:45
  • I do not think that mrjob, or hadoop can sort keys from map before reducer, sort is not free and for large data sets in distributed system it is very difficult and very often not required. So I think it might be possible that it happens 'by some accident' for small inputs but would be a bad thing to do for large ones. – Tomasz Swider Dec 19 '18 at 06:03
  • I totally agree with you! – Dandelion Dec 19 '18 at 06:11