17

Say you have two strings of length 100,000 containing zeros and ones. You can compute their edit distance in roughly 10^10 operations.

If each string only has 100 ones and the rest are zeros then I can represent each string using 100 integers saying where the ones are.

Is there a much faster algorithm to compute the edit distance using this sparse representation? Even better would be an algorithm that also uses 100^2 space instead of 10^10 space.

To give something to test on, consider these two strings with 10 ones each. The integers say where the ones are in each string.

[9959, 10271, 12571, 21699, 29220, 39972, 70600, 72783, 81449, 83262]

[9958, 10270, 12570, 29221, 34480, 37952, 39973, 83263, 88129, 94336]

In algorithmic terms, if we have two sparse binary strings of length n each represented by k integers each, does there exist an O(k^2) time edit distance algorithm?

Simd
  • 19,447
  • 42
  • 136
  • 271
  • While a bit more general, there is a lot of research into compressed alignment. Check out levenshtein/lcs on lz or rle-compressed strings. The latter would allow something like O(n' * m' + something smaller) where those vars are the number of runs (small in your case). E.g. `Ann, Hsing-Yen, et al. "A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings." Information Processing Letters 108.6 (2008): 360-364` But don't underestimate `simple` there. There is a generalization to levenshtein in some other paper i think. – sascha Aug 03 '18 at 18:03
  • I seem to be naive. Isn't the edit distance for binary strings just their XOR? – Vroomfondel Aug 03 '18 at 19:07
  • 1
    @Vroomfondel No. Consider 01010101... and 10101010... – Simd Aug 03 '18 at 19:11
  • [Google Scholar: rle + edit + distance](https://scholar.google.de/scholar?hl=en&as_sdt=0%2C5&q=rle+edit+distance&btnG=). There are also LZ-variants maybe missing with those keywords. – sascha Aug 07 '18 at 20:13

1 Answers1

9

Of course! There are so few possible operations with so many 0s. I mean, the answer is at most 200.

Take a look at

10001010000000001
vs       ||||||
10111010100000010

Look at all the zeroes with pipes. Does it matter which one out of those you end up deleting? Nope. That's the key.


Solution 1

Let's consider the normal n*m solution:

dp(int i, int j) {
    // memo & base case
    if( str1[i-1] == str1[j-1] ) {
        return dp(i-1, j-1);
    }
    return 1 + min( dp(i-1, j), dp(i-1, j-1), dp(i, j-1) );
}

If almost every single character was a 0, what would hog the most amount of time?

if( str1[i-1] == str1[j-1] ) { // They will be equal so many times, (99900)^2 times!
    return dp(i-1, j-1);
}

I could imagine that trickling down for tens of thousands of recursions. All you actually need logic for are the ~200 critical points. You can ignore the rest. A simple modification would be

if( str1[i-1] == str1[j-1] ) {
    if( str1[i-1] == 1 )
        return dp(i-1, j-1); // Already hit a critical point

    // rightmost location of a 1 in str1 or str2, that is <= i-1
    best = binarySearch(CriticalPoints, i-1);
    return dp(best + 1, best + 1); // Use that critical point
    // Important! best+1 because we still want to compute the answer at best
    // Without it, we would skip over in a case where str1[best] is 1, and str2[best] is 0.
}

CriticalPoints would be the array containing the index of every 1 in either str1 or str2. Make sure that it's sorted before you binary search. Keep in mind those gochya's. My logic was: Okay I need to make sure to calculate the answer at the index best itself, so let's go with best + 1 as the parameter. But, if best == i - 1, we get stuck in a loop. I'll handle that with a quick str1[i-1] == 1 check. Done, phew.

You can do a quick check for correctness by noting that at worst case you will hit all 200*100000 combinations of i and j that make critical points, and when those critical points call min(a, b, c), it only makes three recursive function calls. If any of those functions are critical points, then it's part of those 200*100000 we already counted and we can ignore it. If it's not, then in O(log(200)) it falls into a single call on another critical point (Now, it's something we know is part of the 200*100000 we already counted). Thus, each critical point takes at worst 3*log(200) time excluding calls to other critical points. Similarly, the very first function call will fall into a critical point in log(200) time. Thus, we have an upper bound of O(200*100000*3*log(200) + log(200)).

Also, make sure your memo table is a hashmap, not an array. 10^10 memory will not fit on your computer.


Solution 2

You know the answer is at most 200, so just prevent yourself from computing more than that many operations deep.

dp(int i, int j) { // O(100000 * 205), sounds good to me.
    if( abs(i - j) > 205 )
        return 205; // The answer in this case is at least 205, so it's irrelevant to calculating the answer because when min is called, it wont be smallest.
    // memo & base case
    if( str1[i-1] == str1[j-1] ) {
        return dp(i-1, j-1);
    }
    return 1 + min( dp(i-1, j), dp(i-1, j-1), dp(i, j-1) );
}

This one is very simple, but I leave it for solution two because this solution seems to have come out from thin air, as opposed to analyzing the problem and figuring out where the slow part is and how to optimize it. Keep this in your toolbox though, since you should be coding this solution.


Solution 3

Just like Solution 2, we could have done it like this:

dp(int i, int j, int threshold = 205) {
    if( threshold == 0 )
        return 205;
    // memo & base case
    if( str1[i-1] == str1[j-1] ) {
        return dp(i-1, j-1);
    }
    return 1 + min( dp(i-1, j, threshold - 1), dp(i-1, j-1, threshold - 1), dp(i, j-1, threshold - 1) );
}

You might be worried about dp(i-1, j-1) trickling down, but the threshold keeps i and j close together so it calculates a subset of Solution 2. This is because the threshold gets decremented every time i and j get farther apart. dp(i-1, j-1, threshold) would make it identical to Solution 2 (Thus, this one is slightly faster).


Space

These solutions will give you the answer very quickly, but if you want a space-optimizing solution as well, it would be easy to replace str1[i] with (i in Str1CriticalPoints) ? 1 : 0, using a hashmap. This would give a final solution that is still very fast (Though will be 10x slower), and also avoids keeping the long strings in memory (To the point where it could run on an Arduino). I don't think this is necessary though.

Note that the original solution does not use 10^10 space. You mention "even better, less than 10^10 space", with an implication that 10^10 space would be acceptable. Unfortunately, even with enough RAM, iterating though that space takes 10^10 time, which is definitely not acceptable. None of my solutions use 10^10 space: only 2 * 10^5 to hold the strings - which can be avoided as discussed above. 10^10 Bytes it 10 GB for reference.


EDIT: As maniek notes, you only need to check abs(i - j) > 105, as the remaining 100 insertions needed to equate i and j will pull the number of operations above 200.

Nicholas Pipitone
  • 4,002
  • 4
  • 24
  • 39
  • how can you find rightmost or leftmost "one" in unsorted substring by means of binary search? – mangusta Aug 03 '18 at 18:07
  • @mangusta Well, sort it first. LocationOfOnesInStrX would be the indices of the ones. You don't even have to sort it to make it efficient, looping is fine it's already efficient enough. Sorting makes sense though so you can easily binary search. – Nicholas Pipitone Aug 03 '18 at 18:10
  • okay, you added an edit about sorting, how is it possible to find edit distance of two strings if you initially modify them by sorting? – mangusta Aug 03 '18 at 18:11
  • "You don't even have to sort it to make it efficient, looping is fine it's already efficient enough", in that case what we have is the canonical version of edit distance – mangusta Aug 03 '18 at 18:14
  • It's not the string, its the indices. LocationOfOnesInStrX would be something like [10, 15, 28], if there is a 1 at location 10, 15, 28, and zeroes everywhere else. We would be sorting that array. str1 and str2 are the actual strings themselves, which we don't touch. Of course, we could replace all str1[i] with (i in LocationOfOnesInStrX) to reduce memory usage. – Nicholas Pipitone Aug 03 '18 at 18:14
  • @mangusta We wouldn't have the canonical version. The canonical version would call `dp(i-1, j-1)` another 10,000 times in a sequence of 10000 zeroes. If you just loop over LocationOfOnesInStrX, then you could find it in just 100 iterations. – Nicholas Pipitone Aug 03 '18 at 18:22
  • I got your point. Solution 1 makes sense and it utilizes the positions of "ones" while Solutions 2 and 3 would be useful only in case of all 200 difference points coming consecutively along the strings because that's the only case when the difference between i and j could be max. In all other cases they would perform like a usual edit distance – mangusta Aug 03 '18 at 19:21
  • I am sorry I can't quite follow the full claim you are making sorry. Is your algorithm O(k^2 + n) time and space for two strings of length n with k ones and the rest zeros? If so, that's pretty amazing. – Simd Aug 05 '18 at 10:53
  • @Anush AFAICT, they're all variants of the O(nd)-time algorithm for length-n strings of edit distance O(d). – David Eisenstat Aug 12 '18 at 14:54
  • @DavidEisenstat Thanks. I wonder if there is a better algorithm. – Simd Aug 12 '18 at 14:56
  • @Anush I lied. That doesn't handle substitutions. Maybe you don't care (substitutions have cost 2 instead of 1). – David Eisenstat Aug 12 '18 at 15:15
  • `abs(i - j) > 205` -- you can check for 105. If the difference is more than 100, then you know you will eventually have to do at least 100 inserts, which will bring the result over 200. – maniek Aug 14 '18 at 17:30
  • @Anush You can see O(100000 * 205) in Sol 2 and 3, which would be O(n * k) – Nicholas Pipitone Sep 27 '18 at 01:17