0

I have a set of 520 influenza sequences for which I have already done multiple sequence alignment, and computed the pairwise identity matrix. If I'd like to add in another sequence, I have to re-align everything, and recompute the entire PWI matrix. Is there any program I can use to "append" this other sequence to the alignment, and only compute the PWI w.r.t. every other sequence?

A simple example would be as follows. I have a 2x2 alignment, with the following scores.

     SeqA SeqB
SeqA 1.00 0.98
SeqB 0.98 1.00

Without re-running a full alignment, but only running "SeqC" against all the other sequences, I'd like to get the following matrix:

     SeqA SeqB SeqC
SeqA 1.00 0.98 0.99
SeqB 0.98 1.00 0.97
SeqC 0.99 0.97 1.00

I am using the BioPython package, and Python is my preferred language, but I'm okay with Java if need be too.

[I'll disclaim here that I'm cross-posting from BioStars, just in case there's experts here that aren't on BioStars. The BioStars post is: http://www.biostars.org/p/77607/, but it has the exact same content.]

ericmjl
  • 13,541
  • 12
  • 51
  • 80

1 Answers1

2

If your primary issue in the time it takes to re-run alignment (re-computing the PWI matrix should be computationally cheap), then MUSCLE has the capability to do what you're looking for, a procedure commonly termed "profile-profile alignment".

Profile-profile alignment

When passing the -profile flag, the alignments will be "re-aligned to each other, keeping input columns intact and inserting columns of gaps where needed.":

If you have two existing alignments of related sequences you can use the –profile option of MUSCLE to align those two sequences. Typical usage is:

   muscle -profile -in1 one.afa -in2 two.afa -out both.afa

Implementing in Biopython

Biopython has a wrapper around MUSCLE, but I find it easier to just use subprocess to call MUSCLE, then parse the results back into a MultipleSeqAlignment:

# Do profile-profile alignment (one sequence to many aligned)
seq_fn = "influenza_seq.fasta"
aligned_fn = "520_influenza_seqs.afasta"
cmd = ['muscle', '-clwstrict', '-profile', '-in1', seq_fn, '-in2', aligned_fn]
aligner = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = aligner.communicate()

# Get resulting alignment (MultipleSeqAlignment)
alignment =  AlignIO.read(StringIO(stdout), "clustal",
                          alphabet=Alphabet.ProteinAlphabet())
David Cain
  • 16,484
  • 14
  • 65
  • 75
  • Hi David, I have tried your subprocess code but it seems that can not work on my computer. Although I can use bash to execute the command like `muscle -in muscle_align_vap492zjqgxk915.fasta -out muscle_align_vap492zjqgxk915.out`. It can not work with `subprocess` in Python following your code I use `aligner = subprocess.Popen(muscle_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE,shell = True) stdout, stderr = aligner.communicate() ` and the result is stdout is NULL and stderr is `-in: 1: -in: muscle: not found` – sikisis Sep 29 '16 at 00:41
  • @Davide Cain So what can I do to finish because I do not need to use it in BioPython. – sikisis Sep 29 '16 at 00:41