3

I want to create a MapReduce program whose reduce receives k-v pairs sorted by the value. I'm using mrjob, whose SORT_VALUES parameter seemed to be ideal for the task. After setting this parameter to True, the reducer input is not sorted, for instance, I get the following (considering A should come before X):

"ES"    ["X", 3]
"ES"    ["A", "Spain"]

I'm using Python 2.7.5, mrjob==0.6.1 and Hadoop. The local execution of the program gives me:

"ES"    ["A", "Spain"]
"ES"    ["X", 1]
"ES"    ["X", 2]

Which is correct. But the hadoop execution gives:

"ES"    ["X", 3]
"ES"    ["A", "Spain"]

My code is:


import sys, os, re
from mrjob.job import MRJob
from mrjob.step import MRStep 

class MRJoin(MRJob):

    SORT_VALUES = True

    def mapper(self, _, line):
        splits = line.rstrip("\n").split(",")
        if len(splits) == 2: # countries
            symbol = 'A' # countries before clients
            country2digit = splits[1]
            yield country2digit, (symbol, splits[0])
        else: #  clients
            symbol = 'X'
            country2digit = splits[2]
            if splits[1]=='bueno':
                yield country2digit,(symbol, 1)

    def combiner(self,key, values):
        bueno=0
        for value in values:
            if value[0] == 'A':
                yield key, ('A', value[1])
            else:
                bueno=bueno + 1

        if bueno > 0:
            yield key, ('X', bueno)

    def reducerSimple(self, key, values):
        for value in values:
            yield key,value


    def steps(self): 
        return [ 
            MRStep(mapper=self.mapper 
                   ,combiner=self.combiner
                   ,reducer=self.reducerSimple) 
        ] 


if __name__ == '__main__':
    MRJoin.run()

I run the above code like this:

python mrjob-p2.py /media/notebooks/clients.csv /media/notebooks/countries.csv -r hadoop

Which gives:

"ES"    ["X", 3]
"ES"    ["A", "Spain"]
...
"GN"    ["A", "Guinea"]
"GN"    ["X", 1]
...

The values for the ES key (and few others) are not sorted, but for other keys they are sorted.

I expected (A should come before X if the values were sorted):

"ES"    ["A", "Spain"]
"ES"    ["X", 3]

If I run locally:

python mrjob-p2.py /media/notebooks/clients.csv /media/notebooks/countries.csv -r local

Then I get:

"ES"    ["A", "Spain"]
"ES"    ["X", 1]
"ES"    ["X", 2]
...
"GN"    ["A", "Guinea"]
"GN"    ["X", 1]
...

Which is correct.

Does anybody have an idea on how to get the values sorted?

Thanks :)

Ben Watson
  • 5,357
  • 4
  • 42
  • 65

0 Answers0