4

I'm trying to calculate the amount of longest possible subsequences that exist between two strings.

e.g. String X = "efgefg"; String Y = "efegf";

output: The Number of longest common sequences is: 3 (i.e.: efeg, efef, efgf - this doesn't need to be calculated by the algorithm, just shown here for demonstration)

I've managed to do this in O(|X|*|Y|) using dynamic programming based on the general idea here: Cheapest path algorithm.

Can anyone think of a way to do this calculation with better runtime efficiently?

--Edited in response to Jason's comment.

Community
  • 1
  • 1
Meir
  • 12,285
  • 19
  • 58
  • 70
  • 4
    These look to be subsequences and not substrings. Please clarify. – jason Feb 11 '10 at 15:16
  • I am not sure I understand what you are calculating. What is the rule that makes efeg, efef, efgf all valid solutions? I suppose you can't rearrange order of chars, but only remove some? Are the two strings supposed to be completely generic, so that you may have "X=AAAAAAAAAAAAAAAAAAAAAAAAA" and "Y=B" for example, and in this case the answer would be 0? – p.marino Feb 11 '10 at 15:25
  • @p.marino: correct. You can't rearrange the order, but you can remove letters. The answer would be 0 in your example. – Meir Feb 11 '10 at 15:31
  • 3
    For X=AAAAAAAAAAAAAAAAA and Y=B, shouldn't the amount of longest common subsequences be 1? There is one common subsequence of length 0, which is the longest one. – rettvest Feb 11 '10 at 20:58
  • See http://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Complexity, http://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Computing_the_length_of_the_LCS – Beni Cherniavsky-Paskin Feb 26 '10 at 10:43

4 Answers4

1

Longest common subsequence problem is a well studied CS problem.

You may want to read up on it here: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

KaptajnKold
  • 10,638
  • 10
  • 41
  • 56
0

I don't know but here are some attempts at thinking aloud:

The worst case I was able to construct has an exponential - 2**(0.5 |X|) - number of longest common subsequences:

X = "aAbBcCdD..."
Y = "AaBbCcDd..."

where the longest common subsequences include exactly one of {A, a}, exactly one of {B, b} and so forth... (nitpicking: if you alphabet is limited to 256 chars, this breaks down eventually - but 2**128 is already huge.)

However, you don't necessarily have to generate all subsequences to count them. If you've got O(|X| * |Y|), you are already better than that! What we learn from this is that any algorithm better than yours must not attempt to generate the actual subsequences.

Beni Cherniavsky-Paskin
  • 9,483
  • 2
  • 50
  • 58
0

First of all, we do know that finding any longest common subsequence of two sequences with length n cannot be done in O(n2-ε) time unless the Strong Exponential Time Hypothesis fails, see: https://arxiv.org/abs/1412.0348

This pretty much implies that you cannot count the number of ways how to align common subsequences to the input sequences in O(n2-ε) time. On the other hand, it is possible to count the number of ways of such alignments in O(n2) time. It is also possible to count them in O(n2/log(n)) time with the so-called four-Russians speed-up.

Now the real question if you really intended to calculate this or you want to find the number of different subsequences? I am afraid that this latter is a #P-complete counting problem. At least, we do know that counting the number of sequences with a given length that a regular grammar can generate is #P-complete:

S. Kannan, Z. Sweedyk, and S. R. Mahaney. Counting and random generation of strings in regular languages. In ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 551–557, 1995

This is a similar problem in that sense that counting the number of ways a regular grammar can generate sequences of a given length is a trivial dynamic programming algorithm. However, if you do not want to distinguish generations resulting the same sequence, then the problem turns from easy to extremely hard. My natural conjecture is that this should be the case for sequence alignment problems, too (longest common subsequence, edit distance, shortest common superstring, etc.).

So if you would like to calculate the number of different subsequences of two sequences, then very likely your current algorithm is wrong and any algorithm cannot calculate it in polynomial time unless P = NP (and more...).

melpomene
  • 84,125
  • 8
  • 85
  • 148
0

Best Explanation(with Code) I found :

Count all LCS

Jay Patel
  • 505
  • 6
  • 10
  • Please expound upon you answer here, as opposed to simply including an external link. – kjones Jul 03 '17 at 17:14
  • Whilst this may theoretically answer the question, [it would be preferable](//meta.stackoverflow.com/q/8259) to include the essential parts of the answer here, and provide the link for reference. – GhostCat Jul 03 '17 at 18:48