efficiently compute the edit distance between 1 string and a large set of other strings?

Question

The use case is auto-complete options where I want to rank a large set of other strings by how like a fixed string they are.

Is there any bastardization of something like a DFA RegEx that can do a better job than the start over on each option solution?

The guy who asked this question seems to know of a solution but doesn't list any sources.

(p.s. "Read this link" type answer welcome.)

edit distance doesn't seem to be the right metric for auto-completion. — Karoly Horvath, Mar 14 '14 at 23:35
Maybe not, but I suspect it's not a bad building block for when other filters turn up too few results (which is exactly what I'm considering using it for). — BCS, Mar 15 '14 at 01:09
I guess the guy over there meant [something like this](http://stevehanov.ca/blog/index.php?id=114), but that's not really optimal — Niklas B., Mar 15 '14 at 01:31
Basically the idea is that you compute the Levenshtein DP table row by row while walking down the trie (a trie node represents a prefix, which is in turn represented by the upper part of the table). — Niklas B., Mar 15 '14 at 01:35
So basically the runtime depends on how much work you can save by making use of common prefixes (to be more precise, it is O(n * m) where n is the length of the search string and n is the number of edges of the uncompressed trie that represents your dictionary). — Niklas B., Mar 15 '14 at 01:42
@BCS: you "suspect"? just try it, at this point the speed doesn't matter, first figure out whether it's the right approach. I will tell you why I think it isn't: if you have a relatively short prefix, completely replacing it (substition) is cheaper than extending (insertion) it to the right string. — Karoly Horvath, Mar 15 '14 at 07:28

score 2 · Answer 1 · edited Feb 08 '17 at 14:51

2

I did something like this recently. Unfortunately it's closed source.

The solution is to write a levenshtein automaton. Spoiler: it's a NFA.

Although many people will try to convince you that simulating NFAs is exponential, it isn't. Creating a DFA from NFA is exponential. Simulating is just polynomial. Many regex engines are writen with sub-optimal algorithms based on this.

NFA simulation is O(n*m) for a n-sized string and m states. Or O(n) amortized if you convert it to a DFA lazily (and cache it).

I'm afraid you'll either have to deal with complex automata libraries or will have to write a lot of code (what I did).

edited Feb 08 '17 at 14:51

Community

answered Mar 15 '14 at 03:57

Juan Lopes

Doesn't this just give a yes/no answer? How can you "rank a set of strings" according to their Levenshtein distance to another string using that? – Niklas B. Mar 15 '14 at 04:01
At the end of the simulation, you can get any metadata you want from accept states. In the naïve NFA, each accept state will link to it's original string (and the edit distance, encoded in the state). In the converted DFA, the single accept state will link to the pairs of original string and the edit distance for it at that state. – Juan Lopes Mar 15 '14 at 04:10
Based on the fact that the NFA is acyclic (ignoring self edges) and you only care about the min solution, you can compute that with O(n) extra storage (for n=length of reused strings)... but then I think I've just re-invented the standard solution. – BCS Mar 17 '14 at 18:55

1 Answers1