Finding similar strings in large datasets

Question

I'm using levenshtein distance to retrieve similar strings from a list. At the moment the list has just a few thousand items, but we'll need to support at least 100k items.

I'm trying to make this more efficient and one technique I came up with was to calculate the levenshtein distance only on strings that are of similar length. I though about also filtering on the initial character i.e. if the string to search starts with b then I'll run the calculation only on the strings that start with b. But I'm not sure if I could assume this to work all the time.

I was wondering if you all have a better way of getting this done?

Thanks

What is the maximum Levenshtein distance you are still interested in? — Falk Hüffner, Aug 11 '14 at 09:40

score 0 · Answer 1 · answered Aug 10 '14 at 13:00

One way to go would be to hope that a match with small edit distance would have within it a short exact match. If you assume this, then, given the string ABCDEF, retrieve all strings containing ABC, BCD, CDE, or DEF, and compute their edit distances. You may even find that the best match among these is so close that any closer match must have a short match inside it, so you would have found it already. You would have to accept that if you are unlucky you may miss some good matches, or be forced to go through all the possibilities one by one.

As an alternative to building a database of substrings, you could build a http://en.wikipedia.org/wiki/Suffix_array and LCP array from a string obtained by concatenating all the stored strings, separating them with a marker character not otherwise used. This takes time and space linear in the input size. You would then search for exact matches by looking for strings in the suffix array starting ABCDEF, BCDEF, CDEF, and DEF.

Finding similar strings in large datasets

1 Answers1