Algorithm to search for a list of words in a text

Question

I have a list of words, fairly small about 1000 or so. I want to check if any of the words in that list occur in an input text. If so I would like know which ones occur. The input text is a few hundred words each and these are text paragraphs from the web - meaning there a lot of them from different sites. I am trying to find the best algorithm for it.

I can see two obvious ways to do this --

A brute force way of searching for each word from the list in the text.
Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast.

Is there a better solution?

I am using python though I am not sure if that changes the algorithm anyway.

Also as an optimization to the solution 2 above, I would like to store the hash table generated to persistent storage (DB) so that if the list of words changes I can re-use the hash table without having to create it again. Of course if the input text changes I have to generate the hash table. Is it possible to save a hash table to a DB? Any recommendations? I am currently using MongoDB for my project and I can only store json documents in it. I am a new to MongoDB and have only just started working with it and still do not fully understand the full potential of it.

I have searched SO and see two questions along similar lines and one of them suggests a hash table but I would like to get any pointers towards the optimization I have in mind.

Here are the previously asked questions on SO -

Is there an efficient algorithm to perform inverted full text search?

Searching a large list of words in another large list

EDIT: I just found another question on SO which is about the same problem.

Algorithm for multiple word matching in text

I guess there is no better solution than a hash table. But I would really like to optimize it so that changes to the word list can let me run the algorithm on all the text I have stored up quickly. Should I change the tags added to the question to also include some database technologies?

*"Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast. Is there a better solution?"* What is wrong with this approach? Why aren't you satisfied with it? (Have you tried it?) — Ali, Jan 15 '14 at 00:41
That is the best solution I could think of. Am just trying to see if a better solution exists. I have tried it and so am thinking about the optimization I explained I want to add to it. Before I delve into the optimization, I want to make sure there isn't any other solution that I am not considering. — user220201, Jan 15 '14 at 00:44

Jim Mischel · Accepted Answer · 2014-01-15T03:58:57.230

There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.

The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.

You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:

"dog|cat|horse|skunk"

And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.

There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.

If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.

Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:

"\b(cat|dog|horse|skunk)\b"

The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:

hashTable = Build hash table from target words
for each word in input text
    if word in hashTable then
        output word

Or, if you want a list of matching words that are in the input text:

hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
    if word in hashTable then
        add word to foundWords

Algorithm to search for a list of words in a text

1 Answers1

Linked