16

I have a large set of words (about 10,000) and I need to find if any of those words appear in a given block of text.

Is there a faster algorithm than doing a simple text search for each of the words in the block of text?

Enrico Detoma
  • 3,159
  • 3
  • 37
  • 53

6 Answers6

17

input the 10,000 words into a hashtable then check each of the words in the block of text if its hash has an entry.

Faster though I don't know, just another method (would depend on how many words you are searching for).

simple perl examp:

my $word_block = "the guy went afk after being popped by a brownrabbit";
my %hash = ();
my @words = split /\s/, $word_block;
while(<DATA>) { chomp; $hash{$_} = 1; }
foreach $word (@words)
{
    print "found word: $word\n" if exists $hash{$word};
}

__DATA__
afk
lol
brownrabbit
popped
garbage
trash
sitdown
user105033
  • 18,800
  • 19
  • 58
  • 69
  • I was going to recommend KMP, but this is EXACTLY the solution that is required. +1 and should get the answer check. – samoz Jul 08 '09 at 19:13
  • 1
    Yeah this is about as good as you are going to get... O(N) time (assuming a good hash function of course) – DigitalZebra Jul 08 '09 at 19:21
  • @Polaris: Also assuming that the hash of words fits into memory (which, if it's 10k words, should be no problem. Just being pedantic) – llimllib Jul 08 '09 at 19:30
  • 1
    If the "given block of text" is small, and the 10,000 words are sorted, then it might (might) be faster not to bother with the hashtable, just binary chop (even in a newline-separated list). O(M log N) instead of O(M+N). But for general purposes, it has to be either this hashtable, or something fancy like a trie. – Steve Jessop Jul 08 '09 at 19:35
  • Why a hash and not just a linked list, for example? – Dervin Thunk Jul 08 '09 at 19:51
  • @onbyeone I actually did the math, and O(M log n) is faster for around M < 1000, though this doesn't take into account any constant factors. – samoz Jul 08 '09 at 19:55
  • @Dervin: Because looking things up in a linked list is incredibly slow. – Steve Jessop Jul 08 '09 at 20:31
  • @Dervin, linked lists still have to be searched through. With a hash you just hash the key and voila you have access to the data with that key. – user105033 Jul 08 '09 at 20:31
  • You are right, but I haven't been precise (I should probably open another question): my 10,000 "words" should have been "strings", because they are not always single words, but two, three, or even four-words strings... – Enrico Detoma Jul 08 '09 at 20:58
13

Try out the Aho-Corasick algorithm: http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

Cuga
  • 17,668
  • 31
  • 111
  • 166
  • +1 because that should actually solve my real problem (I said "words" but I should have said strings, because they can also be two, three or four-words strings – Enrico Detoma Jul 08 '09 at 21:01
  • Would the Aho-Corasick be efficient enough for such a large set of strings? I found an implementation in Java http://hkn.eecs.berkeley.edu/~dyoo/java/index.html which I could probably use. – Enrico Detoma Jul 08 '09 at 21:06
  • I believe it is, based on all the articles I was able to find related to it online. – Cuga Jul 09 '09 at 18:57
  • 1
    If this solves the "real" problem, then accept Cuga's answer. Aho-Corasick is a classy beast. It's especially useful in your case because of the spaces in the strings in the search dictionary (the set of strings to search for). For example, with User105033's method (hashing), you can't just check each word; rather, you have to check each word /and/ each consecutive 2, 3, 4, ... words, etc. In contrast, with the state machine approach, this is done implicitly. – user359996 Jun 24 '10 at 04:58
5

Build up a trie of your words, and then use that to find which words are in the text.

FryGuy
  • 8,614
  • 3
  • 33
  • 47
4

The answer heavily depends on the actual requirements.

  1. How large is the word list?
  2. How large is the text block?
  3. How many text blocks must be processed?
  4. How often must each text block be processed?
  5. Do the text blocks or the word list change? If, how frequent?

Assuming relativly small text blocks compared to the word list and processing each text block only once, I suggest to put the words from the word list into a hash table. Then you can perform a hash lookup for each word in the text block and find out if the word list contains the word.

If you have to process the text blocks multiple times, I suggest to invert the text blocks. Inverting a text block means creating a list for each word that containing all the text blocks containing the specific word.

In still other situations it might be helpful to generate a bit vector for each text block with one bit per word indicating if the word is contained in the text block.

Daniel Brückner
  • 59,031
  • 16
  • 99
  • 143
1

you can build a graph used as a state machine and when you process the ith character of your input word - Ci - you try to go to the ith level of your graph by checking if your previous node, linked to Ci-1, has a child node linked to Ci

ex: if you have the following words in your corpus
("art", "are", "be", "bee")
you will have the following nodes in your graph
n11 = 'a'
n21 = 'r'
n11.sons = (n21)
n31 = 'e'
n32= 't'
n21.sons = (n31, n32)
n41='art' (here we have a leaf in our graph and the word build from all the upper nodes is associated to this node)
n31.sons = (n41)
n42 = 'are' (here again we have a word)
n32.sons = (n42)
n12 = 'b'
n22 = 'e'
n12.sons = (n22)
n33 = 'e'
n34 = 'be' (word)
n22.sons = (n33,n34)
n43 = 'bee' (word)
n33.sons = (n43)

during your process if you go through a leaf while you are processing the last character of your input word, and only in this case, it means that your input is in your corpus.

This method is more complicated to implement than a single Dictionary or Hashtable but it will be much more optimized in term of memory use

pierroz
  • 7,653
  • 9
  • 48
  • 60
0

The Boyer-Moore string algorithm should work. depending on the size/# or words in the block of text, you might want to use it as the key to search the word list (are there more words in the list then in the block). Also - you probably want to remove any dups from both lists.

meade
  • 22,875
  • 12
  • 32
  • 36