String matching algorithm : (multi token strings)

Question

I have a dictionary which contains a big number of strings. Each string could have a range of 1 to 4 tokens (words). Example :

Dictionary :

The Shawshank Redemption
The Godfather \
Pulp Fiction
The Dark Knight
Fight Club

Now I have a paragraph and I need to figure out how many strings in the para are part of the dictionary. Example, when the para below :

The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been battling The Godfather for the top spot.

is run against the dictionary, I should be getting the ones in bold as the ones that are part of the dictionary.

How can I do this with the least dictionary calls.

Thanks

How massive is the dictionary supposed to be? 'How can I do this with the least dictionary calls' So its about lookup complexity, rather than memory usage - right? Any constraints regarding the programming language? — Constantin, Dec 06 '13 at 23:03
The dictionary has a few million multi token strings. Ya, lookup complexity is the major issue. No constraints reg programming language. The para would be a blob of text — sharath, Dec 06 '13 at 23:14
The [Aho-Corasick string matching algorithm](http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm) is what you want. It builds a trie, and searches are incredibly efficient. It's interesting to note that I typed the question title here into Google and found http://en.wikipedia.org/wiki/String_searching_algorithm as the second search result, where clicking on "Algorithms using a finite set of patterns" got me to Aho-Corasick. Learning how to use a search engine can be very helpful. — Jim Mischel, Dec 07 '13 at 00:48
How long is expected length of your text? how long are dictionary elements themselves? — ile, Dec 07 '13 at 00:53
@Jim, with millions of words in dictionary, I believe, a normal Aho-Corasik will perform very poorly. It doesn't scale very well to be honest.Especially considering the alphabet size isnt really small in this case. — ile, Dec 07 '13 at 00:55
@ile: I've built an Aho-Corasick tree with several million strings and used it to find song titles and band names in video metadata. It was *fast*. — Jim Mischel, Dec 07 '13 at 01:03
@Jim, I understand it depends on the real-life factors. I worked in implementing real-time malware detection, the main subproblem was to find the multiple patterns in text fast. Although, I admit that Aho-Corasick is very useful, it does *not* scale well. Out of curiosity, how big video metadatas are? Less than a couple of kilobytes? — ile, Dec 07 '13 at 01:08
@ile: Yes, individual video metadata was smaller than a few kilobytes. The trie nodes were words rather than individual letters, which helped quite a bit. As described, the OP's problem is very similar to mine. — Jim Mischel, Dec 07 '13 at 04:27

BrDaHa · Answer 1 · 2013-12-06T23:53:00.600

You might be better off using a Trie. A Trie is better suited to finding partial matches (i.e. as you search through the text of a paragraph) that are potentially what you're looking for, as opposed to making a bunch of calls to a dictionary that will mostly fail.

The reason why I think a Trie (or some variation) is appropriate is because it's built to do exactly what you're trying to do:

Basic Trie

If you use this (or some modification that has the tokenized words at each node instead of a letter), this would be the most efficient (at least that I know of) in terms of storage and retrieval; Storage because instead of storing the word "The" a couple thousand times in each Dict entry that has that word in the title (as is the case with movie titles), it would be stored once in one of the nodes right under the root. The next word, "Shawshank" would be in a child node, and then "redemption" would be in the next, with a total of 3 lookups; then you would move to the next phrase. If it fails, i.e. the phrase is only "The Shawshank Looper", then you fail after the same 3 lookups, and you move to the failed word, Looper (which as it happens, would also be a child node under the root, and you get a hit. This solution works assuming you're reading a paragraph without mashup movie names).

Using a hash table, you're going to have to split all the words, check the first word, and then while there's no match, keep appending words and checking if THAT phrase is in the dictionary, until you get a hit, or you reach the end of the paragraph. So if you hit a paragraph with no movie titles, you would have as many lookups as there are words in the paragraph.

My dictionary is massive. It will be impossible to make a trie out of it. — sharath, Dec 06 '13 at 23:01
If author decides to go with a Trie, then probably optimizing it to a Suffix Tree may be a better idea? — ile, Dec 06 '13 at 23:10
@paypalcomp: Impossible? A trie will ultimately result in less storage space, and they're easy to build as you are initializing the dictionary. BrDaHa isn't suggesting you have a dictionary of words AND a trie, but JUST a trie, which is great for exactly what you need. A suffix tree being a version of a trie makes it, too, a good idea. — AndyG, Dec 06 '13 at 23:50
I agree, there are traditional space efficient algorithms (e.g. Suffix Arrays, Suffix Automata) that can be used in O(n + mk), i.e. construction is done in O(n) and for k patterns do lookup in O(m). Obviously some heuristics can be used with this to speed up. However, the most significant optimization, in my opinion, would be to filter the dictionary itself, i.e. consider only top 1000 elements instead of all 1,000,000. — ile, Dec 06 '13 at 23:57
@ile: Done correctly, once the trie is constructed the search speed is not affected by the number of search strings--only the length of the text being searched and the number of matches found. So filtering the dictionary is irrelevant. — Jim Mischel, Dec 07 '13 at 01:00

score 1 · Answer 2 · answered Dec 06 '13 at 23:22

This is not a complete answer, more like an extended-comment.

In literature it's called "multi-pattern matching problem". Since you mentioned that the set of patterns has millions of elements, Trie based solutions will most probably perform poorly.

As far as I know, in practice traditional string search is used with a lot of heuristics. DNA search, antivirus detection, etc. all of these fields need fast and reliable pattern matching, so there should be decent amount of research done.

I can imagine how Rabin-Karp with rolling-hash functions and some filters (Bloom filter) can be used in order to speed up the process. For example, instead of actually matching the substrings, you could first filter (e.g. with weak-hashes) and then actually verify, thus reducing number of verifications needed. Plus this should reduce the work done with the original dictionary itself, as you would store it's hashes, or other filters.

score -1 · Answer 3 · answered Dec 06 '13 at 23:13

-1

In Python:

import re

movies={1:'The Shawshank Redemption', 2:'The Godfather', 3:'Pretty Woman', 4:'Pulp Fiction'}

text = 'The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been battling The Godfather for the top spot.'

repl_str ='(?P<title>' + '|'.join(['(?:%s)' %movie for movie in movies.values()]) + ')'

result = re.sub(repl_str, '<b>\g<title></b>',text)

Basically it consists of forming up a big substitution instruction string out of your dict values. I don't know whether regex and sub have a limitation in the size of the substitution instructions you give them though. You might want to check.

lai

answered Dec 06 '13 at 23:13

lai

1,163
10
15

Just saw that your dictionary is massive. You may loop over it in trunks of 20 movies with the above approach. – lai Dec 06 '13 at 23:17
Another comment: you should sort your dict in descending alphabetical order before doing this. Otherwise you won't catch properly "The Godfather II". – lai Dec 06 '13 at 23:22
But given a string from the para, like : "The Shawshank Redemption considered the.." you would have to make calls for all the 1,2,3 and 4 token combinations of the entire para (Bruteforce). Was wondering if there is a better way out. – sharath Dec 06 '13 at 23:22
example : if your para is " a,b,c,d" : the combinations for which you would make dictionary calls would be :a,b,c,d,ab,bc,cd,abc,bcd,abcd. – sharath Dec 06 '13 at 23:24
Not sure I understand your point. Did you try? On my test, I get this result: 'The Shawshank Redemption considered the greatest movie ever made according to the IMDB Top 250.For at least the year or two that I have occasionally been checking in on the IMDB Top 250 The Shawshank Redemption has been battling The Godfather for the top spot.' – lai Dec 06 '13 at 23:24
Got you. But since I build my regex string with a pipe (OR), and not a succession, the issue you mention does not arise. Try with a string where order is different and you'll see... – lai Dec 06 '13 at 23:27
if the dictionary has millions of elements, the instruction will be few megabytes big. I believe, regular regex will fail, or run slower than a simple O(n+km) search with KMP, Rabin-Karp, or Suffix-Tree's single pattern match. – ile Dec 06 '13 at 23:31
1

e.g. a Trie (mentioned in another solution) constructed from the whole dictionary will be smaller than the instruction in this solution. – ile Dec 06 '13 at 23:33

String matching algorithm : (multi token strings)

3 Answers3