I just wrote an answer to a question:
longest common subsequence: why is this wrong?
This function is supposed to find the longest substring between two strings, but when I tried to figure out the worst-case runtime and the input that would cause that, I realized I didn't know. Consider the code to be C pseudocode.
// assume the shorter string is passed in as A
int lcs(char * A, char * B)
{
int length_a = strlen(A);
int length_b = strlen(B);
// This holds the length of the longest common substring found so far
int longest_length_found = 0;
// for each character in one string (doesn't matter which), look for
// incrementally larger strings in the other
// once a longer substring can no longer be found, stop
for (int a_index = 0; a_index < length_a - longest_length_found; a_index++) {
for (int b_index = 0; b_index < length_b - longest_length_found; b_index++) {
// check the next letter until a mismatch is found or one of the strings ends.
for (int offset = 0;
A[a_index+offset] != '\0' &&
B[b_index+offset] != '\0' &&
A[a_index+offset] == B[b_index+offset];
offset++) {
longest_length_found = longest_length_found > offset ? longest_length_found : offset;
}
}
}
return longest_found_length;
}
Here's my thinking so far:
Below, I'll be assuming A and B are roughly equivalent size as to not have to say ABA, I'll just say n^3. If this is terribly bad, I can update the question.
Without some of the optimizations in the code, I believe the runtime is ABA for a N^3 runtime.
However, if the strings are dissimilar and a long substring is never found, the inner-most for loop would drop out to a constant leaving us with A*B, right?
If the strings are exactly the same, the algorithm takes linear time, as there is only one simultaneous pass through each of the strings.
If the strings are similar, but not identical, then the longest_length_found would become a significant fraction of the smaller of A or B, which would divide out one of the factors in the N^3 leaving us with N^2, right? I'm just trying to understand what happens when they are remarkably similar but not identical.
Thinking out loud, what if on the first letter, you find a substring with a length around half of the length of A. This would mean that you would run A/2 iterations of the first loop, B-(A/2) iterations of the second loop, and then up to A/2 iterations in the third loop (assuming the strings were very similar) without finding a longer substring. Assuming roughly even length strings, that's N/2 * N/2 * N/2 = O(N^3).
Sample strings which could show this behavior:
A A A B A A A B A A A B A A A B
A A A A B A A A A B A A A A B A
Am I close or am I missing something or misapplying something?
I'm pretty sure I could do better using a trie/prefix tree, but again, I'm just really interested in understanding the behavior of this specific code.