5

I have a string S of length 1000 and a query string Q of length 100. I want to calculate the edit distance of query string Q with every sub-string of string S of length 100. One naive way to do is calculate dynamically edit distance of every sub-string independently i.e. edDist(q,s[0:100]), edDist(q,s[1:101]), edDist(q,s[2:102])....... edDist(q,s[900:1000]) .

def edDist(x, y):
""" Calculate edit distance between sequences x and y using
    matrix dynamic programming.  Return distance. """
D = zeros((len(x)+1, len(y)+1), dtype=int)
D[0, 1:] = range(1, len(y)+1)
D[1:, 0] = range(1, len(x)+1)
for i in range(1, len(x)+1):
    for j in range(1, len(y)+1):
        delt = 1 if x[i-1] != y[j-1] else 0
        D[i, j] = min(D[i-1, j-1]+delt, D[i-1, j]+1, D[i, j-1]+1)
return D[len(x), len(y)]

Can somebody suggest an alternate approach to calculate edit distance efficiently. My take on this is that we know the edDist(q,s[900:1000]). Can we somehow use this knowledge to calculate edDist[(q,s[899:999])]...since there we have a difference of 1 character only and then proceed backward to edDist[(q,s[1:100])] using the previously calculated edit Distance ?

Nick Zuber
  • 5,467
  • 3
  • 24
  • 48
rombi
  • 199
  • 3
  • 22

1 Answers1

3

Improving Space Complexity

One way to make your Levenshtein distance algorithm more efficient is to reduce the amount of memory required for your calculation.

To use an entire matrix, that requires you to utilize O(n * m) memory, where n represents the length of the first string and m the second string.

If you think about it, the only parts of the matrix we really care about are the last two columns that we're checking - the previous column and the current column.

Knowing this, we can pretend we have a matrix, but only really ever create these two columns; writing over the data when we need to update them.

All we need here is two arrays of size n + 1:

var column_crawler_0 = new Array(n + 1);
var column_crawler_1 = new Array(n + 1);

Initialize the values of these pseudo columns:

for (let i = 0; i < n + 1; ++i) {
  column_crawler_0[i] = i;
  column_crawler_1[i] = 0;
}

And then go through your normal algorithm, but just make sure that you're updating these arrays with the new values as we go along:

for (let j = 1; j < m + 1; ++j) {
  column_crawler_1[0] = j;
  for (let i = 1; i < n + 1; ++i) {
    // Perform normal Levenshtein calculation method, updating current column
    let cost = a[i-1] === b[j-1] ? 0 : 1;
    column_crawler_1[i] = MIN(column_crawler_1[i - 1] + 1, column_crawler_0[i] + 1, column_crawler_0[i - 1] + cost);
  }

  // Copy current column into previous before we move on
  column_crawler_1.map((e, i) => {
    column_crawler_0[i] = e;
  });
}

return column_crawler_1.pop()

If you want to analyze this approach further, I wrote a small open sourced library using this specific technique, so feel free to check it out if you're curious.

Improving Time Complexity

There's no non-trivial way to improve a Levenshtein distance algorithm to perform faster than O(n^2). There are a few complicated approached, one using VP-Tree data structures. There are a few good sources if you're curious to read about them here and here, and these approaches can reach up to an asymptotical speed of O(nlgn).

Nick Zuber
  • 5,467
  • 3
  • 24
  • 48