0

I want to find the longest common sub-sequence of N strings. I got the algorithm that uses Dynamic Programming for 2 strings, but if I extend it to N, it will consume exponential amount of memory, as I need an array of N dimensions. It is not an option.

In the common case (90%), almost all strings will be the same.

If I try to break my N sequences in N/2 pairs of 2 strings each, run the LCS of 2 strings separately for each pair, I'll have N/2 sub-sequences. I can remove the duplicates and repeat this process until I have only one sub-sequence, that is common to all strings in the input.

Is there something that I am missing? It doesn't look like a solution to a N-hard problem...

I know that each call to LCS with each pair of strings may have more than one sub-sequence as solution, but if I get only one of these sub-sequences to use as input in the next call, maybe my final sub-sequence isn't the longest possible, but I have something that may fit my needs.

If I try to use all possible solutions for one pair and combine then with all possible solutions from another pairs (that each of them may have more than one too), I may end up with exponential time. Am I right?

lmcarreiro
  • 5,312
  • 7
  • 36
  • 63

1 Answers1

1

Yes, you're missing the correctness: there is no guarantee that the LCS of a pair of strings will have any overlap whatsoever with the LCS of the set overall. Consider this example:

aaabb1xyz
aaabb2xyz
cccdd1xyz
cccdd2xyz

If you pair these in the given order, you'll get LCSs of aaabb and cccdd, missing the xyz for the set.

If, as you say, the strings are almost all identical, perhaps the differences aren't a problem for you. If the not-identical strings are very similar to the "median" string, then your incremental solution will work well enough for your purposes.

Another possibility is to do LCS on random pairs of strings until that median string emerges; then you start from that common point, and you should have a "good enough" solution.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • Thanks for your answer, but I think you are confusing with the Longest Common Sub-string problem. From wikipedia: "unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences" – lmcarreiro Nov 01 '17 at 22:49
  • so, in your example, the LCS in the first pair would be `aaabbxyz`, in the second pair would be `cccddxyz` and the final LCS `xyz` – lmcarreiro Nov 01 '17 at 22:51
  • Oh ... a different semantic for "sequential". Understood. In that case, my counter-example isn't counter. – Prune Nov 01 '17 at 23:37
  • 2
    You're still right, though. (xaaabbbx,aaaxxbbb) -> aaabbb and (xcccdddx, cccxxddd) -> cccddd, but the LCS of all 4 is xx – Matt Timmermans Nov 02 '17 at 01:26
  • @MattTimmermans: Thanks. I could see there was another counter example, but didn't have the brain-bandwidth to develop it. – Prune Nov 02 '17 at 16:07