3

I just wrote an answer to a question:

longest common subsequence: why is this wrong?

This function is supposed to find the longest substring between two strings, but when I tried to figure out the worst-case runtime and the input that would cause that, I realized I didn't know. Consider the code to be C pseudocode.

// assume the shorter string is passed in as A
int lcs(char * A, char * B)
{
  int length_a = strlen(A);
  int length_b = strlen(B);

  // This holds the length of the longest common substring found so far
  int longest_length_found = 0;

  // for each character in one string (doesn't matter which), look for 
  //   incrementally larger strings in the other
  // once a longer substring can no longer be found, stop
  for (int a_index = 0; a_index < length_a - longest_length_found; a_index++) {
    for (int b_index = 0; b_index < length_b - longest_length_found; b_index++) {

      // check the next letter until a mismatch is found or one of the strings ends.
      for (int offset = 0; 
           A[a_index+offset] != '\0' && 
             B[b_index+offset] != '\0' && 
             A[a_index+offset] == B[b_index+offset]; 
           offset++) {          
        longest_length_found = longest_length_found > offset ? longest_length_found : offset;
      }
    }
  }
  return longest_found_length;
}

Here's my thinking so far:

Below, I'll be assuming A and B are roughly equivalent size as to not have to say ABA, I'll just say n^3. If this is terribly bad, I can update the question.

Without some of the optimizations in the code, I believe the runtime is ABA for a N^3 runtime.

However, if the strings are dissimilar and a long substring is never found, the inner-most for loop would drop out to a constant leaving us with A*B, right?

If the strings are exactly the same, the algorithm takes linear time, as there is only one simultaneous pass through each of the strings.

If the strings are similar, but not identical, then the longest_length_found would become a significant fraction of the smaller of A or B, which would divide out one of the factors in the N^3 leaving us with N^2, right? I'm just trying to understand what happens when they are remarkably similar but not identical.

Thinking out loud, what if on the first letter, you find a substring with a length around half of the length of A. This would mean that you would run A/2 iterations of the first loop, B-(A/2) iterations of the second loop, and then up to A/2 iterations in the third loop (assuming the strings were very similar) without finding a longer substring. Assuming roughly even length strings, that's N/2 * N/2 * N/2 = O(N^3).

Sample strings which could show this behavior:

A A A B A A A B A A A B A A A B

A A A A B A A A A B A A A A B A

Am I close or am I missing something or misapplying something?

I'm pretty sure I could do better using a trie/prefix tree, but again, I'm just really interested in understanding the behavior of this specific code.

Community
  • 1
  • 1
xaxxon
  • 19,189
  • 5
  • 50
  • 80
  • 6
    Usually when we talk about runtime complexity we talk about a few cases such as "average case", "worst case", or "expected runtime" (in the case of randomized algorithms). Here it seems like you have partitioned the possible inputs into arbitrary sets and said "this runs in `O(n^3)` while this runs in `O(n^2)`". I'm just suggesting that maybe it's worth looking into the definitions again because the question as is, is somewhat ambiguously defined. Also I would claim that it's not linear if the strings are equal... but it depends on your cost model. – rliu May 20 '13 at 02:29
  • I think the answer is that unless I can get the optimizations to the square root of the length of a string, the runtime doesn't get any better -- and the optimizations just create fewer worst cases. – xaxxon May 20 '13 at 02:34
  • Yeah, I'm trying to understand what the worst case is - and what the worst case data set is. I updated the question to be more specific – xaxxon May 20 '13 at 02:34
  • I'm pretty sure it's linear if the strings are equal. The first time through when a_index and b_index both are 0, the inner-most for loop will have offset go from 0 to length of the strings doing constant work per iteration. At which point longest_length_found will be set to the length of the string, causing both the second for loop to terminate and then the outer for loop to terminate as well since 1 !< 0 (string length - itself) – xaxxon May 20 '13 at 02:39
  • I didn't notice you subtracted the longest length from the upper bound of the loop. I don't think that's correct... what if the longest common subsequence in `B` (meaning the copy in `B`) ends with the last character in `B`? Basically, `A = asbb` and `B = axbb` – rliu May 20 '13 at 02:43
  • Pretty sure this is fine. The subtraction is off the starting point. If you've found a substring of length 5, there's no reason to consider strings that have 5 or fewer characters in them. There's nothing stopping you from looking all the way to the end of the string, only the start point is limited. The innermost loop can go all the way to the end of whichever string runs out of characters first. – xaxxon May 20 '13 at 02:49
  • Ah you're right. I didn't really look at the pseudocode too closely to be honest. Regardless, in the worst case the strings aren't similar at all and you get `O(n^3)` like... I think you said above. Are you trying to figure out if there is some set of "similar `A` and `B`" that still have a bad runtime? – rliu May 20 '13 at 02:59
  • if they aren't similar at all, I think the innermost for loop drops to a constant, so you get A*B. – xaxxon May 20 '13 at 03:01
  • Normally when I do runtime determination, I look at the two extremes and see that one is really bad. In this case, very similar or very dissimilar strings. However, they both have good runtimes and I'm trying to figure out if there is a bad one hidden in the middle somewhere. – xaxxon May 20 '13 at 03:06
  • It's honestly really hard to analyze this code because it's a normal brute force + optimizations. I think you might run into problems if the first `n/4` characters match (any constant fraction of `n`), but then subsequently the longest match found is always `n/4-1` or something like that. But again, it's hard to reason about code that is finely tuned. The reason why only thinking about the extreme cases is a bit invalid (I think at least) is because you optimized your algorithm to make the extreme cases work well. – rliu May 20 '13 at 03:10
  • yeah, that's fair. I guess what I'm trying to figure out is if any of the optimizations are significant enough to reduce the worst case runtime. I think the answer is no. – xaxxon May 20 '13 at 03:12
  • I think I finally got a counterexample. I'm still trying to rigorize the math but basically let `A = aaaa...aabb...bbbb` (it's half `a`s then half `b`s). Then let `B` just be the same length but just `a`s. I _believe_ your algorithm will produce a `O(n^3)` runtime for this input. Loosely it's something like `n/2*(n/2) + n/2*(n/2-1) + n/2*(n/2-2) + ... + n/2(2) + n/2(1) => n/2*O(n^2) = O(n^3)`. You can try working the math out by hand in parallel if you want – rliu May 20 '13 at 03:51
  • Do me a favor and put that in an answer, please – xaxxon May 20 '13 at 03:53

2 Answers2

1

I think what roliu said in the comments is bang on the money. I think your algorithm is O(N3) with a best-case of O(N2).

What I actually wanted to point out is the over-indulgence of this algorithm. You see, for every possible starting offset in each string, you test every subsequent matching character to count the number of matches. But consider something like this:

A = "01111111"
B = "11111110"

Almost the first thing you will find is the maximum matching substring starting at A[1] and B[0], and then later on you will test parts of that exact overlap, beginning at A[2], B[1] and so on... What's important here is the relative offset. You can completely drop the N3 part of the algorithm by realising this. Then it becomes a matter of shifting one of the arrays beneath the other.

A         01111111
B  11111110
B   11111110
B    11111110
B        ... -->
B                11111110

To make the code less complicated, you can test just half of the system, then swap the arrays and test the other half:

// Shift B under A
A  01111111
B  11111110
B      ... -->
B         11111110

// Shift A under B
B  11111110
A  01111111
A      ... -->
A         01111111

If you do this, then you have something like O((A+B-2) * min(A,B) / 2), or more conveniently O(N2)

int lcs_half(char * A, char * B)
{
    int maxlen = 0, len = 0;
    int offset, i;
    for( offset = 0; B[offset]; offset++ )
    {
        len = 0;
        for( i = 0; A[i] && B[i+offset]; i++ )
        {
            if( A[i] == B[i+offset] ) {
                len++;
                if( len > maxlen ) maxlen = len;
            }
            else len = 0;
        }
    }
    return maxlen;
}

int lcs(char * A, char * B)
{
    int run1 = lcs_half(A,B);
    int run2 = lcs_half(B,A);
    return run1 > run2 ? run1 : run2;
}
paddy
  • 60,864
  • 6
  • 61
  • 103
  • My algorithm has a clear best case of O(N) for identical strings. "Almost the first thing you will find is the maximum matching substring starting at A[1] and B[0], and then later on you will test parts of that exact overlap, beginning at A[2], B[1]" -- that's not true. It will immediately stop after finding a substring of length 7, as there are no more possible ways to get a substring of 8 or greater. – xaxxon May 20 '13 at 03:34
  • Also, there are times when you do have to check things you already have. For instance A="ABABCD" B="ABABABCD". You find a match of ABAB but you can't just completely skip ahead, you do have to check "BABCD" as well, and then you find that you do have a match using some of the parts already tested. – xaxxon May 20 '13 at 03:36
  • I think the algorithm I presented is too intricate (not hard, but time consuming to ingest) to get a proper answer on stack overflow. And that's fine. Still thinking about your shift solution. – xaxxon May 20 '13 at 03:38
  • That wasn't the point, and maybe my example was bad because I made the strings only contain the interesting part. I mean that if you have a longer string with these patterns inside, you will be repeatedly counting up smaller-length matches from an overlapping section that you've already examined earlier. – paddy May 20 '13 at 03:39
  • Regarding your second comment, that is all handled by my alternative solution. Note that the inner loop can detect any number of matching sequences at the current shift offset. I haven't done the end-of-string optimization that you did, because I don't even calculate string lengths, but it wouldn't be much of a pain to do it. – paddy May 20 '13 at 03:41
  • By the way, I'm not really trying to provide this as an answer, but it just wasn't practical to write it in the comments =) Just putting it here because the question was interesting. – paddy May 20 '13 at 03:42
1

So after we talked about it in the comments we agreed that the question is finding the worst-case runtime for the code. We can claim it's at least Omega(n^3) with the following proof:

Let
A = aaaa...aabb...bbbb meaning that |A| = n and it's composed of n/2 a's and n/2 b's.
B = aaaa.... where |B| = n.

Now we consider the first n/2 iterations of the outer-most loop (i.e. the first n/2 starting indices for the A string). Fix some iteration i of those first n/2 iterations of the outer-most loop. The upper bound of the second loop is at least n-n/2 = n/2 because the LCS of the two strings has length n/2. For each iteration of the second loop we match a string of length n/2 - i (you can prove this by contradiction). So we have that after the first n/2 iterations of the outer-most loop that the line:

longest_length_found = longest_length_found > offset ? longest_length_found : offset;

has run:

n/2*(n/2) + n/2*(n/2-1) + n/2*(n/2-2) + ... + n/2*(2) + n/2*(1) = n/2*Omega(n^2) = Omega(n^3)

Specifically, for the first iteration of the outer-most loop we have a string of n/2 a's in string A and there are n/2 starting spots in B. For each starting spot in B we'll match a full common substring of length n/2 (meaning we'll hit that line n/2 times). So that's n/2*(n/2). For the next iteration of the outer-most loop we have a string of n/2-1 a's in string A and there are still n/2 starting spots in B. In this case we match a common substring of length n/2-1 for each starting index => n/2(n/2-1). The same argument works inductively up to i = n/2.

Anyways, we know that the running time of the algorithm on the input is longer than the running time of the first n/2 iterations of the outer-most loop so it's also Omega(n^3).

rliu
  • 1,148
  • 6
  • 8
  • No problem. I had to make up for the fact that I posted several garbage comments critiquing your code without actually reading the code... I was feeling lazy. – rliu May 20 '13 at 16:43