0

I'm looking for the most efficient way of solving an Levenshtein edit distance problem.

We are given as input:

  1. A set of strings S of size n <= 8, with average length m <= 50
  2. A target string t of length l <= 50

Our task is to 'align' t with S i.e:

Find the subset s* of S, where concat(s*) has the minimum Levenshtein edit distance (among all the subsets) with t

Some thoughts:

  1. For 2 strings of length l1 and l2, we could use the standard dynamic programming algorithm which has O(l1*l2) time complexity
  2. Brute forcing this would require us to compute editDistance(t, concat(s')) for each subset s'. This would be approximately O(l.m.n!)
  3. This could be optimized a bit by memoizing the results i.e if we are computing editDistance(t, S[1, 2, 3, 4]) we could re-use the computation from S[1, 2, 3])
  4. The other option I could think of is to construct a Trie or DAWG (Directed Acyclic Word Graph)

But, I'm not an expert on this, so I might be missing a better solution. How would we do this efficiently?

Thanks in advance!

Mathguy
  • 157
  • 11
  • Optimization idea: length difference >= Levenshtein - distance (could be a shortcut in some cases) – MrSmith42 Aug 02 '23 at 08:11
  • `concat(s*)` requires the elements of `s*` to be ordered in some way. Is there a particular order (i.e., you really mean "sequence" and "subsequence" instead of "set" and "subset"), or is it required to find the best ordering? – Matt Timmermans Aug 02 '23 at 12:06

1 Answers1

0

Outline of a memoize solution. (I just didn't do levenshtein for you.)

def best_answer(S, t, p0=0, p1=None, cache=None):
    # Defaults
    if cache is None:
        cache = {}
    if p1 is None:
        p1 = len(t)
    if ('s', p0, p1) not in cache:
        best_strings = []
        best_error = p1 - p0
        for s in S:
            leven = levenshtein(s, t[p0:p1])
            if leven < best_error:
                best_strings = [s]
                best_error = leven
        for p in range(p0+1, p1):
            (best_strings0, best_error0) = best_answer(S, t, p0, p, cache)
            (best_strings1, best_error1) = best_answer(S, t, p, p1, cache)
            if best_error0 + best_error1 < best_error:
                best_strings = best_strings0 + best_strings1
                best_error = best_error0 + best_error1
        cache[(p0, p1)] = (best_strings, best_error)
    return cache[('s', p0, p1)]
btilly
  • 43,296
  • 3
  • 59
  • 88