Finding popular keywords in huge list

Question

I have a huge list with about 100 000 lines like this:

ipadnews
abcipad
cddeeffipad
hellworld
iworldthis .. and so on

And would like to find popular substrings, in this case "ipad" would be the most popular and "world" would be on second place. Minimum length should be three or four chars.

I can't predict the substrings so using a dictionary is a no no.

what are keywords then? any substring of the words found in your list? it sure sounds like that... and it's a hell of a complexity — hummingBird, Nov 12 '10 at 20:04
How is the algorithm supposed to know that "world" is an acceptable keyword, but now "worl" or "orld"? — Axn, Nov 12 '10 at 21:01
@Wooble, @Playcat: Good point! Should be substring and not keyword. Thanks — Jonas Lejon, Nov 13 '10 at 13:32
@Axn: You don't know until you do the matching of popular substrings — Jonas Lejon, Nov 13 '10 at 13:32

score 4 · Answer 1 · answered Nov 12 '10 at 20:04

This is a relatively complicated problem ... but it's tractable using prefix/suffix trees. It's essentially a variation of the longest common subsequence and longest common substring problems. - which is where I would start.

There's actually quite a bit of research on problems on this form - you should be able to use the terms above to narrow your search.

score 2 · Answer 2 · answered Nov 12 '10 at 20:07

2

You can solve this using a generalized suffix tree which can be built in O(n) time. This is effectively a play on the LCS problem.

answered Nov 12 '10 at 20:07

jason

236,483
35
423
525

score 0 · Answer 3 · answered Nov 12 '10 at 20:22

I would go about this problem using the following flow of logic:

Extract the set of suffixes for each word. So from 'ipadnews' we get: 'ipadnews', 'padnews', 'adnews', and so on. This way, 'news' will be one of the suffixes, but not 'ipad'.
To make up for the missing substrings in the above step, extract the prefixes as well. We get 'ipadnew', 'ipadne', and so on, including 'ipad'.
For each of the substrings above, hash them towards a count, e.g. $hash{$substr}++.

At the end we will have a long hashtable with frequency of words as values. Instead of an expensive sorting, suppose you only want 10 most popular words. Keep a set from the beginning whose criteria is that any word in it must have a score more than the current min score. You can keep track of the word with min score and when you add the 11th item with score more than the min score, bump out the word with the min score and update the min score pointer.

The max number of keys in the hashtable will be 2*k*n where k is the average length of the words and n is total number of words.

This won't find "world" in "iworldthis", which OP seems to expect. May require extracting the set of prefixes from each element of the set of suffixes? — Wooble, Nov 13 '10 at 19:05

Finding popular keywords in huge list

3 Answers3