25

I'm writing a program which has to compute a multiple sequence alignment of a set of strings. I was thinking of doing this in Python, but I could use an external piece of software or another language if that's more practical. The data is not particularly big, I do not have strong performance requirements and I can tolerate approximations (ie. I just need to find a good enough alignment). The only problem is that the strings are regular strings (ie. UTF-8 strings potentially with newlines that should be treated as a regular character); they aren't DNA sequences or protein sequences.

I can find tons of tools and information for the usual cases in bioinformatics with specific complicated file formats and a host of features I don't need, but it is unexpectly hard to find software, libraries or example code for the simple case of strings. I could probably reimplement any one of the many algorithms for this problem or encode my string as DNA, but there must be a better way. Do you know of any solutions?

Thanks!

a3nm
  • 8,717
  • 6
  • 31
  • 39
  • what do you mean by compute? are you trying to get a best alignment? – dting Apr 28 '11 at 06:56
  • Yes, or a reasonably good alignment (approximations are ok). – a3nm Apr 29 '11 at 01:17
  • Are you looking for an better diff tool too? – jan-glx Oct 29 '16 at 14:44
  • @Chris_Rands: Thanks! Indeed, it is packaged for Debian and seems to work on a simple example. It has some drawbacks though: spaces are removed (so they need to be handled separately), and UTF-8 is not supported (you need to convert to Latin-1 and it seems experimental). Thanks for pointing this out! Please don't hesitate to post it as an answer. – a3nm Apr 04 '17 at 14:07
  • Did you find a solution to this? if you did - please post your code :) – Jenny Jan 07 '22 at 09:23
  • @Jenny: I'm afraid I don't have anything new to share here. If I had, I would post it. :) – a3nm Jan 07 '22 at 11:17

4 Answers4

16
  • The easiest way to align multiple sequences is to do a number of pairwise alignments.

First get pairwise similarity scores for each pair and store those scores. This is the most expensive part of the process. Choose the pair that has the best similarity score and do that alignment. Now pick the sequence which aligned best to one of the sequences in the set of aligned sequences, and align it to the aligned set, based on that pairwise alignment. Repeat until all sequences are in.

When you are aligning a sequence to the aligned sequences, (based on a pairwise alignment), when you insert a gap in the sequence that is already in the set, you insert gaps in the same place in all sequences in the aligned set.

Lafrasu has suggested the SequneceMatcher() algorithm to use for pairwise alignment of UTF-8 strings. What I've described gives you a fairly painless, reasonably decent way to extend that to multiple sequences.

In case you are interested, it is equivalent to building up small sets of aligned sequences and aligning them on their best pair. It gives exactly the same result, but it is a simpler implementation.

James Crook
  • 1,600
  • 12
  • 17
  • 2
    I agree that it's feasible, but it's still some work. There *are* dedicated algorithms for multiple sequence alignment which seem to be all over the place in computational biology--there *has* to be some way to do the same things for strings. (Or if there isn't, I guess I'll have to write it, but this would be quite surprising...) Thanks anyway for your helpful answer, I'll combine it with lafrasu's current one for a quick and dirty solution if nothing better shows up. – a3nm May 04 '11 at 03:46
  • Do you have code that does this? – Jenny Jan 11 '22 at 08:17
5

Are you looking for something quick and dirty, as in the following?

from difflib import SequenceMatcher

a = "dsa jld lal"
b = "dsajld kll"
c = "dsc jle kal"
d = "dsd jlekal"

ss = [a,b,c,d]

s = SequenceMatcher()

for i in range(len(ss)):
    x = ss[i]
    s.set_seq1(x)
    for j in range(i+1,len(ss)):

        y = ss[j]
        s.set_seq2(y)

        print
        print s.ratio()
        print s.get_matching_blocks()
lafras
  • 8,712
  • 4
  • 29
  • 28
  • To be more precise: SequenceMatcher() does exactly what I want except that I have more than two sequences, and I don't see how I can deduce a global alignment from the pairwise alignments. I suppose I could cook up some dirty trick intersecting the common parts, but I would be quite unwilling to do something like that if there are regular clean algorithms for the multiple sequences case. Do you know anything like SequenceMatcher() but for more than two strings? – a3nm Apr 29 '11 at 01:34
  • @a3_nm: You're right, finding a _globally_ optimal alignment from the set of _local_ pairwise alignments is tricky. I'm still thinking about this. – lafras May 02 '11 at 19:18
  • 1
    your code is giving me wrong syntax on `print s.ratio()` – syrkull Jun 27 '18 at 18:11
2

MAFFT version 7.120+ supports multiple text alignment. Input is like FASTA format but with LATIN1 text instead of sequences and output is aligned FASTA format. Once installed, it is easy to run:

mafft --text input_text.fa > output_alignment.fa

Although MAFFT is a mature tool for biological sequence alignment, the text alignment mode is in the development stage, with future plans including permitting user defined scoring matrices. You can see the further details in the documentation.

Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
1

I've pretty recently written a python script that runs the Smith-Waterman algorithm (which is what is used to generate gapped local sequence alignments for DNA or protein sequences). It's almost certainly not the fastest implementation, as I haven't optimized it for speed at all (not my bottleneck at the moment), but it works and doesn't care about the identity of each character in the strings. I could post it here or email you the files if that's the kind of thing you're looking for.

DaveTheScientist
  • 3,299
  • 25
  • 19