2

What would be an efficient MapReduce algorithm to find the top-k elements from a dataset, when k is too big to fit k elements in memory? I am talking about a dataset of millions of elements and k being e.g. 3/4 of them. Imagine that each element has a value and we want to find the k elements with the highest values.

E.g. data in the form:

e1: 5

e2: 10

e3: 7

e4: 8

Then, the top 2 are e4 and e2 (not caring about their relevant order).

I have seen the solution to the problem, when k is small enough, but it does not scale. Obviously, using a single reducer, would again not be practical (out of memory error).

Community
  • 1
  • 1
vefthym
  • 7,422
  • 6
  • 32
  • 58
  • You want the top-k elements upon what? I think that you forgot about this part. What scoring functions are you using? (per example) – eliasah Jul 11 '14 at 07:30
  • Thanks, I updated my question. Assume that the scores are given and we just want to extract the k elements with the top scores. – vefthym Jul 11 '14 at 07:38
  • 1
    This means that you want to apply a [terasort](http://sortbenchmark.org/YahooHadoop.pdf) on your entries using the score to rank them. – eliasah Jul 11 '14 at 07:40
  • Related (possibly a duplicate): http://stackoverflow.com/questions/17410399/finding-k-largest-elements-of-a-very-large-file-while-k-is-very-large/17481369#17481369 – Jim Mischel Jul 11 '14 at 14:11
  • @eliasah very good point! But I cannot find any way to run terasort on my own data. – vefthym Jul 12 '14 at 17:21
  • @JimMischel your solution seems great, even if a little bit too complicated IMHO. Anyway, implementing it in MapReduce is not at all straightforward to me. Thank you for this great feedback though! – vefthym Jul 12 '14 at 17:46
  • 1
    "Even if a bit too complicated." That's actually kind of funny. It's a simple disk sort with an early out. Pretty standard stuff. – Jim Mischel Jul 13 '14 at 04:08

3 Answers3

3

I think that I found what I was looking for. The answer was found here: http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/

The idea is to use a TotalOrderParitioner. This partitioner needs a sampling first, which can be generated by using an InputSampler, such as RandomSampler. This sampling is used, I believe, for load balancing, to ensure that all the reducers will get almost the same amount of work (data).

The problem with the default partitioner (the hashPartitioner) is that the reducer in which a (key, value) pair will end up, is based on the key's hash. Then, the sorting takes place within each reducer's input. This does not guarantee that a greater key will be handled by a "following" reducer. The TotalOrderPartitioner guarantees the latter and the sampling is used for load balancing.

After the data have been totally ordered, we can either take the last k (e.g. by using the tail -k command in unix on the result of hadoop dfs -getmerge), or by using an Inverted Comparator and taking the first k, as Thomas Jungblut suggests. Feel free to comment/edit my answer, if it is not correct.

EDIT: A better example (in terms of source code) is provided here.

EDIT 2: It seems that this problem is a "classic" one after all and the solution is also described in the Section "Total Sort" of Tom White's book "Hadoop the Definitive Guide" (page 223 of the 1st edition). You can also follow this link for a free preview.

vefthym
  • 7,422
  • 6
  • 32
  • 58
1

You need a two MR job approach:

First job:

Do your described logic in the mapper to get grouped counts in the reducer. The reducer then will write the count (as key) to a key-value pair (as value). The reducer here can be parallelized in case you hit performance problems.

Second job:

The mapper just maps the identity. Take care of sorting descending by defining an inverted comparator.

The single reducer here gets the descending sorted data. Then you can simply increment until you hit "k" and emit the values.

Note that you may have items that have the same count, so you will need to count every value you get from the reduced values as a new "k".

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
  • Didn't quite get the first job. I already have the values, based on which I should sort the data, to get the top K. What do you mean "get the grouped counts"? Can you elaborate more on the first job please? Do you mean count the number of elements that have a specific value? – vefthym Jul 11 '14 at 11:02
  • @vefthym If you already have the key -> count mappings, ignore the first mapper. Just invert this mapping to be a count -> key+value mapping – Thomas Jungblut Jul 11 '14 at 11:13
  • My problem is that a single reducer does not fit all the data, so while it copies the data from all the mappers, it will run out of memory. I think that I should use something like a TotalOrderPartitioner. Do you agree? – vefthym Jul 12 '14 at 17:43
0

This may not be the most efficient but it is simple to understand and easy to implement.

MapReduce Stage-1: Set the number of reducers to 1.

  • Map: Read the input (key k, value v) pairs and send them to the reducer with the key as v and the value as k.
  • Reduce: The shuffle stage will sort the data based on the numeric values (since they are the keys) as data is sent across the network. Data will arrive at the reducer and the reducer will output a single file in sorted order.

MapReduce Stage-2: No reduce phase required.

  • Map: Read the single, sorted file and output the top k elements.

If you want to select the top k where k is a percentage, then you can use a Hadoop counter during the Stage-1 map phase to count how many records exist in the input file and then use another counter during the Stage-2 to select the top k percent.

ahaque
  • 341
  • 1
  • 5