Efficiently get the top-k words of a document from a sorted list of words

Question

I have a list of words, ordered in ascending frequency (df) for a document corpus.

From each document, I want to extract only the top-k words, from the ranked list (i.e. I want the k most infrequent words of a document, based on all counts).

What is the most efficient way to implement this (preferrably in java)?

A naive implementation is:

public List<String> getTopK(List<String> wordCounts, Set<String> document, int k) {//wordcounts are in ascending order        
    List<String> topK = new ArrayList<>(); //the list to be returned
    for (String topWord : wordCounts) { //given in ascending order of frequency
        if (document.contains(topWord)) { //assume HashSet --> O(1) for contains
            topK.add(topWord);  //again O(1) for add
            k--;
        }
        if (k == 0) {
            break;
        }
    }
    return topK;
}

I have to do this for every document from D documents and for every word of W words (in the worst case), so, in total O(D*W) complexity, which is too expensive (both D and W are in the order of millions).

Are you interested in every word, including the often ignored like "the", "a", "and", "is" etc? Have you considered using an api like `lucene` or something similar to help you? — Eypros, Jul 07 '14 at 12:26
I consider every word, including stopwords. I have not considered using lucene. I don't know if/how it could help, as I am implementing this in MapReduce. But anyway, I would like to see the solution, even if it is already implemented somewhere. — vefthym, Jul 07 '14 at 12:28
Do you want the "k most infrequent words" of a document based on all counts or a "k most infrequent word" per document? Anyway, this is rather easy to implement in MapReduce, maybe you should tell us where you are stuck. — Thomas Jungblut, Jul 07 '14 at 12:31
I want the k most infrequent words of a document, based on all counts. I have already implemented this in MapReduce (a first M/R job counts the word frequencies and a second job does what I described in the question, i.e. extracts the most infrequent words). I just wonder if its the optimal solution. — vefthym, Jul 07 '14 at 12:35
Not sure I understood the problem, but I don't see how it could be better. Basically what you call DW is the size of the input (basically you read each word and chose in O(1) to take it or not). So the asymptotic complexity is the same as the one needed to read the data. And I don't see how you could solve the problem without reading all the input (data). — user189, Jul 07 '14 at 12:41
Your document is a set of strings? (In a set, each item contains only once) and there is no predefined order? — Willem Van Onsem, Jul 07 '14 at 12:41
It could be improved by using a more efficient string search algorithm than the contains() - which uses indexOf()... Maybe try using StringSearch http://johannburkard.de/software/stringsearch/ — rob, Jul 07 '14 at 12:42
@CommuSoft good point. I am interested in whether a document contains a word or not, that's why I use a set of words to represent it. — vefthym, Jul 07 '14 at 12:44

score 2 · Accepted Answer · answered Jul 07 '14 at 13:30

2

Instead of storing the df of words in a sorted list, use a map word -> df and then for each document go through all the words and take the top k. Using this approach, the complexity will be O(D*w) with w the number of words in one document which is a lot less than W the number of words in all documents.

Since O(D*w) is the size (in words) of the corpus, you can't do better.

answered Jul 07 '14 at 13:30

Thomash

6,339
1
30
50

You are right, very good idea! I didn't actually keep the df of words. I just used it to sort the words and then I only keep the words in memory, without the df, caring only for their order. Alternatively I could keep as the values of the map the relative position (rank) of the words in the sorted list. – vefthym Jul 07 '14 at 13:57
It's way faster now! I used this post to sort each document map (I created one map for each document), based on the values (DFs): http://stackoverflow.com/a/2581754/2516301. – vefthym Jul 07 '14 at 14:57

Efficiently get the top-k words of a document from a sorted list of words

1 Answers1