2

I'm writing a Sublime Text script to align several lines of code. The script takes each line, splits it by a predefined set of delimiters (,;:=), and rejoins it with each segment in a 'column' padded to the same width. This works well when all lines have the same set of delimiters, but some lines may have extra segments, an optional comma at the end, and so forth.

My idea is to come up with a canonical list of delimiters. Specifically, given several strings of delimiters, I would like to find the shortest string that can be formed from any of the given strings using only insertions, with ties broken in some sensible manner. After some research, I learned that this is the well-known problem of global multiple sequence alignment, except that there are no mismatches, only matches and indels.

The dynamic programming approach, unfortunately, is exponential in the number of strings - at least in the general case. Is there any hope for a faster solution when mismatches are disallowed?

Thom Smith
  • 13,916
  • 6
  • 45
  • 91

1 Answers1

1

I'm a little hesitant to make a blanket statement that there is no such hope, even when mismatches are disallowed, but I'm pretty sure that there isn't. Here's why.

The size of the dynamic programming table generated when doing sequence alignment is approximately (string length)^(number of strings), hence the exponential run-time/space requirement. To give you a feel of where that comes from, here's an example with two strings, ABC and ACB, each of length 3. This gives us a 3x3 table:

    A B C
  A 0 1 2
  C 1 1 1
  B 2 1 2

We initialize this table starting from the upper left and working our way down to the lower right from there. The total cost to get to any location in the table is given by the number at that location (for simplicity, I'm assuming that insertions, deletions, and substitutions all have a cost of 1). The operation used to get to a given location is given by the direction that you moved from the previous value. Moving to the right means you are inserting elements from the top string. Moving down inserts elements from the sideways string. Moving diagonally means you are aligning elements from the top and bottom. If these elements don't match, then this represents a substitution and you increase the cost to get there.

And that's the problem. Saying mismatches aren't allowed doesn't rule out the operations that are responsible for the length and height of the table (insertions/deletions). Worse, disallowing mismatches doesn't even rule out a potential move. Diagonal movements in the table are still possible sometimes, just not when the two elements don't match. Plus, you still need to check to see if the elements match, so you're basically still considering that move. As a result, this shouldn't be able to improve your worst case time and seems unlikely to have a substantial effect on your average or best case time either.

On the bright side, this is a pretty important problem in bioinformatics, so people have come up with some solutions. They have their flaws, but may work well-enough for your case (particularly since it seems likely that you'll be less likely to have spurious alignments than you would with DNA, given that your strings are not-composed of a four-letter alphabet). So take a look at Star Alignment and Neighbor Joining.

seaotternerd
  • 6,298
  • 2
  • 47
  • 58