4

OK this is what I want to do:

Get more than two strings and "align" them (no DNA/RNA sequence or the like, just regular strings with not like 1000 items in each of them)

I've already done some work with pairwise alignment (align two strings) however the "gaps" create some issues for me when trying to align more than one pair.

Example (one I'm currently testing):

ABCDEF
ABGHCEEF
AJKLBCDYEOF

AB--CDEF
ABGHCEEF
=======================
AB--C-EF

A-B--C--E-F
AJKLBCDYEOF
=======================
A----C--E-F

And another (more illustrative) example :

http://nest.drkameleon.com
http://www.google.com
http://www.yahoo.com

http://nest.drkameleon.com
http://-www.--google--.com

=======================
http://----.------le--.com

http://----.------le--.com
http://-www.-----yahoo.com

=======================
http://----.----------.com

What I'm currently doing :

  • Sort the strings (longer strings come first in the list)
  • Align the first pair : A-B and get the result (let's say R1)
  • Then align the second pair : R1 and C (result in R2)
  • Then align the third pair : R2 and D
  • And so on...

So what's in your mind? How could I go for that? Is there a better way? (Of course, there must be...)

I'd rather do that in Perl/Python or something along these lines, however any type of code/reference would be more than welcome! :-)

Dr.Kameleon
  • 22,532
  • 20
  • 115
  • 223
  • Can you perhaps post some examples of what the inputs and outputs might be? I'm not 100% on what you actually want to do. – Li-aung Yip Apr 09 '12 at 13:03
  • also take a look at this article which explains in a detailed way the LCS problem in python. http://wordaligned.org/articles/longest-common-subsequence#toc21 – luke14free Apr 09 '12 at 13:05
  • @Li-aungYip Here's what I mean : http://stackoverflow.com/questions/10065293/how-to-align-2-strings – Dr.Kameleon Apr 09 '12 at 13:10
  • @luke14free This is correct; although, it deals only with pair-wise alignment. What I need is a way to align MORE than 2 string sequences... – Dr.Kameleon Apr 09 '12 at 13:11
  • I suggest you give us some sample groups to work with, so we have an idea of how the string differ and how they are similar. – Phil H Apr 09 '12 at 15:44
  • @PhilH I've update my main post; please have a look. – Dr.Kameleon Apr 09 '12 at 15:56

2 Answers2

1

I think you may be able to cast this problem as a more general string diff problem instead of a string alignment. Consider how GNU diff is used for finding differences between two files, and use the same algorithms as are used to perform an N-way diff.

I'm not sure if the time/memory complexity of this approach is amenable to your needs, but you can at least think about the problem this way.

Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49
1

There is an algorithm based on Levenshtein algorithm to compute the longest common sequence, with optional spaces. Not sure if that helps.

Alberto
  • 499
  • 4
  • 23
  • 1
    Well, obviously I have played a lot with the Levenshtein algorithm, and then gave a try even to Hirschberg's, but what may come closer to my case is the **Needleman-Wunsch Algorithm** (http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) – Dr.Kameleon Apr 09 '12 at 15:50