4

I am using yelps MRJob library for achieving map-reduce functionality. I know that map reduce has an internal sort and shuffle algorithm which sorts the values on the basis of their keys. So if I have the following results after map phase

(1, 24) (4, 25) (3, 26)

I know the sort and shuffle phase will produce following output

(1, 24) (3, 26) (4, 25)

Which is as expected

But if I have two similar keys and different values why does the sort and shuffle phase sorts the data on the basis of first value that appears?

For example If I have the following list of values from mapper

(2, <25, 26>) (1, <24, 23>) (1, <23, 24>) 

The expected output is

(1, <24, 23>) (1, <23, 24>) (2, <25, 26>)

But the output that I am getting is

(1, <23, 24>) (1, <24, 23>) (2, <25, 26>)

is this MRjob library specific? Is there anyway to stop this sorting on the basis of values??

CODE

from mrjob.job import MRJob
import math

class SortMR(MRJob):



def steps(self):
    return [
        self.mr(mapper=self.rangemr,
                reducer=self.rangesort)]


def rangemr(self, key, line):
    for a in line.split():
        yield 1,a


def rangesort(self,numid,line):
    for a in line:
        yield(1, a)


if __name__ == '__main__':
    SortMR.run()
j0k
  • 22,600
  • 28
  • 79
  • 90
Read Q
  • 1,405
  • 2
  • 14
  • 26

4 Answers4

4

The only way to 'sort' the values is to use a composite key which contains some information from the value itself. Your key's compareTo method can then ensure that the keys are sorted first by the actual key component, then by the value component. Finally you'll need a group partitioner to ensure that in the reducer all the keys with the same 'key' component (the actual key) are considered equal, and the associated values iterated over in one call to the reduce method.

This is known as a 'secondary sort', a question similar to this one provides some links to examples.

Community
  • 1
  • 1
Chris White
  • 29,949
  • 4
  • 71
  • 93
  • Clearly, as per what I saw and developed using mrjob library, the values that I received were sorted on the basis of keys as well as the first value that I provided in the list of values at the end of map phase. I did not specifically write a composite key or any method to handle such keys. – Read Q Jan 16 '13 at 13:44
  • Today I actually implemented the job on EMR and surprisingly the output is not sorted. I guess this happens only when you are running the job on local machine. – Read Q Jan 17 '13 at 05:47
  • It shouldn't matter whether its running locally or not. Care to post your code? – Chris White Jan 17 '13 at 11:37
3

The local MRjob just uses the operating system 'sort' on the mapper output.

The mapper writes out in the format:

key<-tab->value\n

Thus you end up with the keys sorted primarily by key, but secondarily by value.

As noted, this doesn't happen in the real hadoop version, just the 'local' simulation.

Evin
  • 321
  • 3
  • 5
0

The sort & shuffle phase doesn't gaurantee on the order of values that the reducer gets for a given key.

Magham Ravi
  • 603
  • 4
  • 8
0

Sort in hadoop is key based and hence it doesn't guarantee the order of the values.

Pramod Solanky
  • 1,690
  • 15
  • 17