How Many Top Words We Have to Select from Each Text File in Order to Get the Top k Words of the Corpus

Question

If I have a corpus containing 100 articles, and each article has a lot of words, so I want to count each article independently on different threads or in distributed system.

So for each article I will get a list of words sorted by word frequency, something like (in C++):

//         count, word
vector<pair<int, string> > v0;
sort(v0.begin(), v0.end(), greater<pair<int, string> >); // descending order

For the other 99 articles, we will get similar sorted results, v1, v2, ... v99

My question is, how do we merge sort the result to get the top k (say 10) words in the corpus?

NOTE: this corpus might be in a distributed system, we may not want to get all words from each list, so the question becomes: how many top words we have to select from each article in order to get the top k=10 words of the entire corpus? In order words, can we discard any words from each article?

score 0 · Answer 1 · edited May 23 '17 at 12:20

0

Have a look at Tf-idf. There is also a similar question, which was answered a couple of years back.

edited May 23 '17 at 12:20

Community

1
1

answered Mar 06 '14 at 22:01

arcolife

386
1
3
12

How Many Top Words We Have to Select from Each Text File in Order to Get the Top k Words of the Corpus

1 Answers1