0

I am new to hadoop and have been struggling to write a mapreduce algorithm for finding top N values for each A value. Any help or guide to code implementation would be highly appreciated.

Input data
a,1
a,9
b,3
b,5
a,4
a,7
b,1

output
a 1,4,7,9
b 1,3,5

I believe we should write a Mapper that would read the line, split the values and allow it to be collected by reducer. And once in the reducer we have to do the sorting part.

Amit Pandey
  • 1,436
  • 2
  • 24
  • 34

2 Answers2

2

If the number of values per key is small enough, the simple approach of just having the reducer read all values associated to a given key and output the top N is probably best.

If the number of values per key is large enough that this would be a poor choice, then a composite key is going to work better, and a custom partitioner and comparator will be needed. You'd want to partition based on the natural key (here 'a' or 'b', so that these end up at the same reducer) but with a secondary sort on the value (so that the reducer will see the largest values first).

cohoz
  • 750
  • 4
  • 16
1

The secondary sort trick mentioned by cohoz seems to be what you're looking for.

There's a nice guide here, which even has a similar structure to your problem (in the example, the author is seeking to walk over each integer timestamp (1,2,3) in sorted order for each class (a,b,c). You'll simply need to modify the reducer in the example to just walk over the top n items and emit them, then stop.

Simplefish
  • 1,130
  • 7
  • 22