5

I have two very large strings and I am trying to find out their Longest Common Substring.

One way is using suffix trees (supposed to have a very good complexity, though a complex implementation), and the another is the dynamic programming method (both are mentioned on the Wikipedia page linked above).

Using dynamic programming alt text

The problem is that the dynamic programming method has a huge running time (complexity is O(n*m), where n and m are lengths of the two strings).

What I want to know (before jumping to implement suffix trees): Is it possible to speed up the algorithm if I only want to know the length of the common substring (and not the common substring itself)?

Community
  • 1
  • 1
Lazer
  • 90,700
  • 113
  • 281
  • 364

4 Answers4

3

These will make it run faster, though it'll still be O(nm).

One optimization is in space (which might save you a little allocation time) is noticing that LCSuff only depends on the previous row -- therefore if you only care about the length, you can optimize the O(nm) space down to O(min(n,m)).

The idea is to keep only two rows -- the current row that you are processing, and the previous row that you just processed, and throw away the rest.

Larry
  • 4,491
  • 2
  • 24
  • 16
  • @Larry: thanks! I had already implemented this one, though. Any others that occur to you? – Lazer Apr 25 '10 at 21:42
  • The other is to implement both top-down and bottom-up. You can apply some branch and bound techniques with top-down to speed things up, and possibly skip states that will never be needed. – Larry Apr 25 '10 at 21:54
2

Will it be faster in practice? Yes. Will it be faster regarding Big-Oh? No. The dynamic programming solution is always O(n*m).

The problem that you might run into with suffix trees is that you trade the suffix tree's linear time scan for a huge penalty in space. Suffix trees are generally much larger than the table you'd need to implement for a dynamic programming version of the algorithm. Depending on the length of your strings, it's entirely possible that dynamic programming will be faster.

Good luck :)

Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
  • 2
    @Billy ONeal: are you comparing suffix tree and dynamic programming? I am not asking for that. `What I what to know is that is there a way to make the dynamic programming algorithm faster if I only want to know the length of the common substring?` – Lazer Apr 25 '10 at 21:28
  • @eSKay: I believe the first part of my answer answers that question. – Billy ONeal Apr 25 '10 at 21:36
  • @eSKay: The algorithm for finding the length and for finding the actual string are the same. The only difference is that the algorithm finding the actual string stores the actual string along with that length in the table. By not storing the string, you can go faster because you spend less time allocating strings. Therefore, it's faster, but not in terms of Big-Oh. – Billy ONeal Apr 25 '10 at 21:57
0

Here's a simple algorithm which can finishes in O((m+n)*log(m+n)), and much easier to implement compared to suffix tree algorithm which is O(m+n) runtime.

let it start with min common length (minL) = 0, and max common length (maxL) = min(m+n)+1.

1. if (minL == maxL - 1), the algorithm finished with common len = minL.

2. let L = (minL + maxL)/2

3. hash every substring of length L in S, with key = hash, val = startIndex.

4. hash every substring of length L in T, with key = hash, val = startIndex. check if any hash collision in to hashes. if yes. check whether whether they are really common substring. 

5. if there're really common substring of length L, set minL = L, otherwise set maxL = L. goto 1.

The remaining problem is how to hash all substring with length L in time O(n). You can use a polynomial formula as follow:

Hash(string s, offset i, length L) = s[i] * p^(L-1) + s[i+1] * p^(L-2) + ... + s[i+L-2] * p + s[i+L-1]; choose any constant prime number p.

then Hash(s, i+1, L) = Hash(s, i, L) * p - s[i] * p^L + s[i+L];
DU Jiaen
  • 955
  • 6
  • 14
-1

Myer's bit vector algorithm can help you. It works by using bit manipulation and is a much faster approach.

Nathan Tuggy
  • 2,237
  • 27
  • 30
  • 38
  • @Lance: "Use X named canonical algorithm" is **definitely** an answer, albeit a somewhat sparse one. – Nathan Tuggy Nov 16 '15 at 04:30
  • Um, I have no recollection of making that comment. Sorry. If anything, I'd have called it out for being a link-only answer. – Lance Nov 16 '15 at 04:57