I have a list of words, fairly small about 1000 or so. I want to check if any of the words in that list occur in an input text. If so I would like know which ones occur. The input text is a few hundred words each and these are text paragraphs from the web - meaning there a lot of them from different sites. I am trying to find the best algorithm for it.
I can see two obvious ways to do this --
A brute force way of searching for each word from the list in the text.
Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast.
Is there a better solution?
I am using python though I am not sure if that changes the algorithm anyway.
Also as an optimization to the solution 2 above, I would like to store the hash table generated to persistent storage (DB) so that if the list of words changes I can re-use the hash table without having to create it again. Of course if the input text changes I have to generate the hash table. Is it possible to save a hash table to a DB? Any recommendations? I am currently using MongoDB for my project and I can only store json documents in it. I am a new to MongoDB and have only just started working with it and still do not fully understand the full potential of it.
I have searched SO and see two questions along similar lines and one of them suggests a hash table but I would like to get any pointers towards the optimization I have in mind.
Here are the previously asked questions on SO -
Is there an efficient algorithm to perform inverted full text search?
Searching a large list of words in another large list
EDIT: I just found another question on SO which is about the same problem.
Algorithm for multiple word matching in text
I guess there is no better solution than a hash table. But I would really like to optimize it so that changes to the word list can let me run the algorithm on all the text I have stored up quickly. Should I change the tags added to the question to also include some database technologies?