2

I'm trying to figure out how to compute the shortest sequence containing a given set of subsequences. For example, given: abcd bcdgh cdef The answer should be abcdefgh

I was thinking about first computing the longest common subsequences of all strings and then go string by string and add what's missing. Given that I want to run this on an input of about 5-10 sequences, each 50-100 items long, with LCS this would be of O(100^10), a bit too time consuming..

Would the following approach give the near-optimal answer for most inputs?

  • Compute LCS of string 1 and 2
  • Add missing items from string 1 and 2
  • Compute LCS of result with string 3
  • Add missing items from string 3 .. and so on?

(assuming not because there is ambiguity on where to add the missing items after each step)

I'm looking for a fast computation (a few milliseconds) and am ready to accept occasional non-optimal solutions if an efficient deterministic algorithm is not possible.

I'm sure people have thought about this, would be glad if someone can point me in the right direction.

Thanks,

Martin

Looking up literature on LCS and related problems

Martin
  • 21
  • 1
  • I'm going to say that this is impossible to solve in a few milliseconds. At least in the general case. There may be some information that you haven't shared that simplifies the problem. For example, all of the sequences started from a common ancestor, and either you know the evolutionary tree, or the changes allowed between generations are tightly constrained. – user3386109 Jan 23 '23 at 22:46
  • Does each item appear at most once in each subsequence? – David Eisenstat Jan 24 '23 at 12:02
  • Letters can appear an arbitrary number of times, and unfortunately there is no evolution tree. However my "alphabet" will typically be rather short (5 to 10 different letters). Not all letters will appear in all subsequences. – Martin Jan 24 '23 at 20:12

1 Answers1

1

I would try a greedy approach based on a topological sort.

First for each subsequence, we can record how many of each symbol they contain. And for every pair of letters that are both the first copy of those letters for that subsequence we can record the rules about which letter comes first, and which second in a graph. For every first copy of a letter that is preceded by a second or later, we record the fact in the graph that it has an currently impossible dependency.

Now we try to do a topological sort, except that each time we use a first letter, we now add the next copy of the letter to our sort, with the appropriate transition rules. And we also keep track of where we actually are in each subsequence as we go.

In the example you first presented, the topological sort finds the answer without problems. But life gets interesting when we get stuck.

When we get stuck, we have to insert a letter.

Our preference to recognize that, say, some word has two "e"s in it, and so words with one "e" left can match their "e" to the second copy, and we can put the first down now. We are in this situation if there is a letter that at least one word is stuck at, and all of the words that use this letter the most remaining times are ready to use it now. We then take every word that doesn't have this letter next, say that they will match this letter to the next time it comes up, and use this letter. (If there are multiple choices for such a letter, use the one that is in the longest remaining subsequence first.)

Otherwise we are going to have to insert a letter in such a way that the combined sequence will have to be longer. So pick the letter for the longest remaining subsequence, insert it, and proceed.

For simple examples, like the one you gave, this will always find the optimal solution. But otherwise it is greedy, not optimal. You'll need to test to see whether it finds good enough solutions. But there is at least a decent chance that it will work well enough.

btilly
  • 43,296
  • 3
  • 59
  • 88