I have a list of words, ordered in ascending frequency (df) for a document corpus.
From each document, I want to extract only the top-k words, from the ranked list (i.e. I want the k most infrequent words of a document, based on all counts).
What is the most efficient way to implement this (preferrably in java)?
A naive implementation is:
public List<String> getTopK(List<String> wordCounts, Set<String> document, int k) {//wordcounts are in ascending order
List<String> topK = new ArrayList<>(); //the list to be returned
for (String topWord : wordCounts) { //given in ascending order of frequency
if (document.contains(topWord)) { //assume HashSet --> O(1) for contains
topK.add(topWord); //again O(1) for add
k--;
}
if (k == 0) {
break;
}
}
return topK;
}
I have to do this for every document from D documents and for every word of W words (in the worst case), so, in total O(D*W) complexity, which is too expensive (both D and W are in the order of millions).