Finding if text contains any words in a list. Which is faster and why?

Question

There might be better solutions but the two I first think of are:

1) For each word in the list, check if the text contains that word 2) Store the words in a set. Store words (anything separated by spaces- doesn't have to be too accurate) from the text in another set and check if the intersection of the 2 sets is empty

I can't tell which would be better or if they're about the same.

The only correct answer: measure and compare the two implementations yourself. — Matt Ball, Feb 04 '13 at 16:54
Calculating hashes of words and then comparing words with matching hashes may be faster. — Alexey Frunze, Feb 04 '13 at 16:56
Is this language specific or you need an algorithm, if algorithm is required, checkout Boyer–Moore and Rabin-Karp for searching a word in a text — Michael, Feb 04 '13 at 16:58
Additionally, your decision is dependent strongly on the relative size of the text — Michael, Feb 04 '13 at 17:01
this might be of some value http://stackoverflow.com/q/1099985/1236044 — jbl, Feb 04 '13 at 17:05
It likely depends on size of the text relative to the search list. In any event, I can't imagine an algorithm doing much better than (2). Depending on what you mean by (1), it could be significantly worse. — Patrick87, Feb 04 '13 at 17:13

Haile · Answer 1 · 2013-02-04T22:04:22.823

This is the set matching problem.

Let S a set of patterns, T your text, and n the number of elements in S found in T. Then you can find all occurences of elements in S in the text in time O(|T| + |S| + n) [*] using the Aho–Corasick string matching algorithm.

Given that you just want to find the first occurrence, the execution time is reduced to O(|T| + |S|) in the worst case, wich is linear in the length of the text if S is small enought!

[*] |S| is the length of all the words in the set

score 0 · Answer 2 · answered Feb 04 '13 at 17:13

nThe most sophisticated implementation of Java, Python and C++ do not use single algorithm for this type of search.

The decision upon what algorithm to use will be decided as a result of text size, search frequency, distribution of words etc.. (multiple algorithms could be used together also)

if the text is large, and you need to search only few words in a text, most of the implementation use extended version of Boyer-Moore or Rabin-Karp algorithms.

algorithm like Rabin-Karp for example search a hash match and if it is found it than searches the whole word , with good rolling hash function it will happen rarely,

Storing a Set of Text words seems to be a better solution vs your first suggestion, Although storing a Hashed values of you words can be even better solution (with additional mapping between the hash values and real words).

If your text has high distinctiveness it will not meter to hold the set. You have much more solutions that what you have suggested, i advice you to use google.

score 0 · Answer 3 · answered Feb 05 '13 at 09:23

Create a trie from one of the sets and lookup every word of the second set in it. Considering average length a string as k, trie construction takes Θ(n*k) time, and checking if a string belongs to a trie takes O(k).
For simplicity you can just consider running time as O((n+m)*k). However more precise analysis gives Θ(n*k) + O(n*k), because you can actually finish long before scanning the whole second set. This shows that it's better to construct the trie from the smaller set and lookup elements from the bigger one.

Finding if text contains any words in a list. Which is faster and why?

3 Answers3