0

I have implemented secondary sorting for my application.

File-1                          File-2                    File-3
------                          ------                    ------

name,pos,r,value           name,pos,r,value            name,pos,r,value

   aa,1,0,123                 aa,2,1,1                    aa,3,1,11
   bb,1,0,234                 aa,2,2,34                   aa,3,2,12
                              aa,2,3,55                   aa,3,3,13
                              bb,2,1,99                   bb,3,1,15
                              bb,2,2,54                   bb,3,2,19
                              bb,2,3,32                   bb,3,3,13

For every record in File-1, three records will be available in File2 and File3 each.

composite key is ::name + (pos+r)

natural key is :: name

sorting order is based on the composite key. Ascending order based on (pos+r)

Expected output is

File1 contents of a particular name (aa) followed by all file2 contents (three rows of aa ordered based on pos+r) and then followed by file three contents (three rows of aa ordered based on pos+)

aa,123,1,34,55,11,12,13

bb,234,99,54,32,15,19,13

I have implemenyed this in secondary sorting using setGroupingComparatorClass, setSortComparatorClass and custom partitioner.

My doubts are : ??

1) How to add combiner for this scenario.

  • According to my understanding, the grouping and sorting happens in the reducer phase once all the map outputs (which are partitioned based on natural key)are transferred to reduce machine.

2) If combiner is added, how and when the sorting will happen so that the reduce function receives outputs from all mapper in proper order .

  • Will the map outputs be sorted twice, once in combiner that's executed after every map and again on the reducer side to sort all the combiner outputs ?
Raghavi Ravi
  • 55
  • 12

1 Answers1

0

Will suggest you to kindly go through http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/

  1. Sorting happens on mapper .
  2. Merging (sorting and merging) happens on reducer.
  3. Combiner is a extra layer, Where you try to reduce on Mapper.
  4. A reducer always receives all given values for a given key.
  5. Mapper sends the values of a given key in sorted fashion.

Please makeyourself aware about group comparator and Sort comparator and use it appropriately.

KrazyGautam
  • 2,839
  • 2
  • 21
  • 31
  • If I implement a class to sort the keys and use the same as mapred.output.key.comparator.class, will the sorting happen on map side or reduce side. If it happens on the map side, will the outputs of all mappers undergo one more round of sorting ? – Raghavi Ravi Nov 22 '17 at 12:04
  • Lets start from here, Assume you have 100000000000 Mapper and 100 reducer. So a reducer gets mapfiles from 100000000000 mappers. All these map files are sorted . Now reducer has to open file pointer to all the mapfiles to get for a given key all the values. Problem is opening so many file pointers is not possible. Hence what the reducer does is start combing the 100000000000 to say 10000 files. while combing the files it does merge as the input files are sorted the merge results in a sorted file. Sorting always happen on mappers reducer does a shuffle , merge( sort + merge) , reduce – KrazyGautam Nov 22 '17 at 12:25
  • Thank you for the bringing in the clarity on the sorting. I now understand that the spill files generated by every map is sorted by key (in secondary sorting composite key). But, in secondary sort, the reducer receives values (K,List) in the ordered specified by SortComparator. If the reducer phase just does merging, how does it arrange the values belonging to the same key from many mapfiles in correct order every time. – Raghavi Ravi Nov 23 '17 at 06:32