4

I have a huge set of key words. Given a text , I want to be able to recognize only those words that occur in the key list of words and ignore all the other words. What is the best way to approach this?

kc3
  • 4,281
  • 7
  • 20
  • 16

3 Answers3

4

The Aho-Corasick algorithm is a fast algorithm for recognizing a set of pattern strings in a larger source string. It's employed by several search utilities, along with many antivirus programs, since it runs in time O(m + n + z), where n is the total size of all the pattern strings you're trying to match, m is the length of the string to search, and z is the total number of matches. Moreover, if you know in advance what strings you're searching for, you can do the O(n) work offline and reduce the search time to O(m + z).

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • There is a difference between strings and words. A key idea in that algorithm is that when you fail to match `foo`, it knows that you could be about to match `oof`. But that is not true if you're trying to match whole words. – btilly May 20 '11 at 16:55
  • This is a good point. You could store the strings with spaces before and after them (for example, "`HELLO`" stored as "` HELLO `" or perhaps using periods and dots as boundaries as well. – templatetypedef May 20 '11 at 22:28
  • Actually silly error. When you fail to match `foo` you could be about to match `of` but *not* `oof`. Anyways it is easier just to skip the complications of that algorithm and just use a trie. – btilly May 20 '11 at 22:34
3

Store your words in a trie.

Walk your text. Every time you start a word, start walking the trie. If you end the word at the end of a word in the trie, that is a word you were interested in. Otherwise it wasn't.

You will have minor complications around the definition of a word. In particular non-word characters usually end a word, but there are exceptions such as don't.

Note that some regular expression engines (Perl's in any recent version of Perl for one) are smart enough to automatically construct a trie and try to match it. Therefore there is a good chance that you can just join your words together with pipes, and throw it at a regular expression engine and get good performance.

If that does not work, you can construct a regular expression that encodes a trie. For instance given the list foo, bar, baz, blat the regular expression /\b(foo|b(?:a(?:r|z)|lat))\b/ should match those words and only those words. It probably won't do it as efficiently as hand-rolled C (for instance on Perl's engine you'll be encountering checks for slow-performing complex regular expressions, and it will likely do some silly backtracking that it didn't need to do) but it will be a lot less work to put together.

btilly
  • 43,296
  • 3
  • 59
  • 88
  • If my list of key words is say around 10000. Is your method still efficient? – kc3 May 20 '11 at 17:52
  • @kc3: Yes. The effort to build a trie is roughly proportional to the total number of letters in all of your words. Once built, the time to match is roughly proportional to the size of your text. I say roughly because there are some implementation details around how you store the trie that can introduce various factors. – btilly May 20 '11 at 18:19
1
  1. Put your keywords into a data structure that allows easy lookup. For example, a hash table or binary tree. If you're hardcore, you can create a perfect hash from your keywords.
  2. Use a DFA to break the input into "words". This can be done with a regular expression library or a simple state machine.
  3. Look up each "word" to see if it's one of your keywords.
Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175