I have a big sequence file storing the tfidf values for documents. Each line represents line and the columns are the value of tfidfs for each term (the row is a sparse vector). I'd like to pick the top-k words for each document using Hadoop. The naive solution is to loop through all the columns for each row in the mapper and pick the top-k but as the file becomes bigger and bigger I don't think this is a good solution. Is there a better way to do that in Hadoop?
Asked
Active
Viewed 706 times
3
-
Interesting question. Upvoted. Well see if the answer on http://stackoverflow.com/questions/9872099/find-the-largest-k-numbers-in-k-arrays-stored-across-k-machines can be used to get top-k numbers and further you may get the words from some {colID -> word} representation you used to create tfidf vectors. – Aditya Jun 11 '15 at 07:17
-
If it works, you may add an answer to this question yourself. – Aditya Jun 11 '15 at 07:18
-
1can u give example of the data – sp_user123 Jun 23 '15 at 11:09
1 Answers
1
1. In every map calculate TopK (this is local top K for each map)
2. Spawn a signle reduce , now top K from all mappers will flow to this reducer and hence global Top K will be evaluated.
Think of the problem as
1. You have been given the results of X number of horse races.
2. You need to find Top N fastest horse.

KrazyGautam
- 2,839
- 2
- 21
- 31