How to effeciently find all fuzzy matches between a set of terms and a list of sentences?

Question

I have a list of sentences (e.g. "This is an example sentence") and a glossary of terms (e.g. "sentence", "example sentence") and need to find all the terms that match the sentence with a cutoff on some Levenshtein ratio.

How can I do it fast enough? Splitting sentences, using FTS to find words that appear in terms and filtering terms by ratio works but it's quite slow. Right now I'm using sphinxsearch + python-Levelshtein, are there better tools?

Would the reverse search: FTS matching terms in sentence be faster?

*"How can I do it fast enough?"* - how fast is *"fast enough"*? *"Would the reverse search: FTS matching terms in sentence be faster?"* - why not try it and find out? — jonrsharpe, Sep 08 '15 at 17:13
Faster than now cause it can take several seconds at this moment and want to do it at least twice faster. — x3al, Sep 08 '15 at 17:17
*"Faster than now"* isn't at all helpful. *"at least twice faster"* is at least feasibly testable. — jonrsharpe, Sep 08 '15 at 17:18

score 0 · Answer 1 · answered Sep 08 '15 at 17:27

If speed is a real issue, and if your glossary of terms is not going to be updated often, compared to the number of searches you want to do, you could look into something like a Levenshtein Automaton. I don't know of any python libraries that support it, but if you really need it you could implement it yourself. To find all possible paths will require some dynamic programming.

If you just need to get it done, just loop over the glossary and test each one against each word in the string. That should give you an answer in polynomial time. If you're on a multicore processor, you might get some speedup by doing it in parallel.

How to effeciently find all fuzzy matches between a set of terms and a list of sentences?

1 Answers1