Hadoop - Properly sort by key and group by reducer

Question

I have some data coming out from the reducer which are like this :

And I would like to sort them according to the number on the second column. Like this :

When I run my program locally, I use :

sort -k2,2n

But I don't know how to do the same thing on Hadoop. I've tried several option which are not working, such as :

-D mapreduce.partition.keycomparator.options=-k2,2n

And moreover, I would like that all the data which have the same key to go on the same reducer. So in this case :

2,3   0

and

6,3   0

should be processed by the same reducer.

Any ideas of the option I should put on hadoop ?

Thank you in advance !

score 1 · Accepted Answer · answered Oct 25 '15 at 19:08

In default configuration of job, first columns are the keys of result from reducer, second is the value. To produce result, reducer is processing all records with same keys. So in your case you need run a additional mapreduce job. The map will put second column as key and first as value. This job will group data according to your request. But if you have small amount of data as result, you setup only one reducer per your job -D mapred.reduce.tasks=1.

Hadoop - Properly sort by key and group by reducer

1 Answers1