1

This is an MRJob implementation of a simple Map-Reduce sorting functionality. In beta.py:

from mrjob.job import MRJob

class Beta(MRJob):
    def mapper(self, _, line):
        """
        """
        l = line.split(' ')
        yield l[1], l[0]

    def reducer(self, key, val):
        yield key, [v for v in val][0]


if __name__ == '__main__':
    Beta.run()

I run it using the text:

1 1
2 4
3 8
4 2
4 7
5 5
6 10
7 11

One can run this using:

cat <filename> | python beta.py

Now the issue is the output is sorted assuming that the key is of type string (which is probably the case here). The output is:

"1"     "1"
"10"    "6"
"11"    "7"
"2"     "4"
"4"     "2"
"5"     "5"
"7"     "4"
"8"     "3"

The output that I want is:

"1"     "1"
"2"     "4"
"4"     "2"
"5"     "5"
"7"     "4"
"8"     "3"
"10"    "6"
"11"    "7"

I am not sure if this is to do with fiddling with protocols in MRJob as protocols are job specific and not step specific.

EDIT (Solution): I have got the answer for this one. The idea is that one needs to prepend 'O-bytes' to every number such that the number of bytes in every number is same the number of bytes in the largest number. At least that's what I remembered from my classes. I cannot add the answer right now as it won't permit me but this is the only solution I've got. If anyone's got something more transparent and easy, please share.

p0lAris
  • 4,750
  • 8
  • 45
  • 80
  • So this one python script actually creates a mapreduce job for the whole cluster? .. I usually use Hadoop Streaming when use Python Script to write map reduce.. – B.Mr.W. Nov 23 '13 at 00:18
  • Well I am not sure about that but yes, it can. That's what `MRJob` enables. You can read more about it here — http://pythonhosted.org/mrjob/. – p0lAris Nov 23 '13 at 00:21

1 Answers1

2

Simple solution (more robust might be based on tuning how Hadoop is sorting mapper output)

class Beta(MRJob):

    def mapper (self, _, line):
        l = line.strip('\n').split()
        yield '%010d'%int(l[1]), l[0]

    def reducer(self, key, values):
        yield int(key),int(list(values)[0])
Wajih
  • 905
  • 9
  • 13