Hadoop: Sorting by first two keys numerically?

Question

I am looking for hadoop (using Streaming and Python) to sort the outputs of the Mapper by the first two keys;

My mapper prints as follows print '%s\t%s\t%s' & (num1, num2, value)

I want my reducers to receive this data sorted by num1 and then num2, so that these outputs:

are delivered to reducers like so (assuming we have 3 reducers):

1   2   A
1   10  B
-----------
2   1   C
------------
10  3   D

I have tried to use the mapred.text.key.partitioner.options option setting it to -k1n,1 -k2n,2 but this doesn't seem to be working.

Any ideas?

I basically want Hadoop to perform this unix sorting: sort -k1n,1 -k2n,2

The version of Hadoop I am using is 0.20.2

Thanks

score 0 · Answer 1 · answered Dec 18 '13 at 13:13

I do not answer your question, but here is a solution: if you simply concatenate num1 and num2 in your mapper output, the default sort will do the trick. Just be careful with the printed format: you need to control the number of zero before non-zero numbers (e.g. 0002 preceeds 0010, but 2 follows 10).

Hadoop: Sorting by first two keys numerically?

1 Answers1