I think that I found what I was looking for. The answer was found here:
http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/
The idea is to use a TotalOrderParitioner. This partitioner needs a sampling first, which can be generated by using an InputSampler, such as RandomSampler. This sampling is used, I believe, for load balancing, to ensure that all the reducers will get almost the same amount of work (data).
The problem with the default partitioner (the hashPartitioner) is that the reducer in which a (key, value) pair will end up, is based on the key's hash. Then, the sorting takes place within each reducer's input. This does not guarantee that a greater key will be handled by a "following" reducer.
The TotalOrderPartitioner guarantees the latter and the sampling is used for load balancing.
After the data have been totally ordered, we can either take the last k (e.g. by using the tail -k
command in unix on the result of hadoop dfs -getmerge
), or by using an Inverted Comparator and taking the first k, as Thomas Jungblut suggests. Feel free to comment/edit my answer, if it is not correct.
EDIT: A better example (in terms of source code) is provided here.
EDIT 2: It seems that this problem is a "classic" one after all and the solution is also described in the Section "Total Sort" of Tom White's book "Hadoop the Definitive Guide" (page 223 of the 1st edition). You can also follow this link for a free preview.