2

I want to do a MultipleSequenceAlignment in biopython but with a self defined Alphabet. The Background is: My sequences are sequences of numeric states and there are up to 5000 states. Thus I need an alphabet with 5000 letters, e.g. '0001', '0042', '4999'. Those sequences are up to 50 states/letters long.

So my main questions are:

  • How can I define such an Alphabet?
  • How can I use this Alphabet with the MultipleSequenceAlignment?

Alternatively: Is it possible to perform a MultipleSequenceAlignment on Lists/Arrays instead of Sequences?

Thanks for you Time & Help!

MattDMo
  • 100,794
  • 21
  • 241
  • 231
  • try posting this to biostars.org – pcantalupo Sep 06 '15 at 22:27
  • be very careful, many aligners take evolutionary history into account. E.g. transition/transversion ratio etc. Doing what you suggest will violate the assumptions behind these methods! – Stylize Sep 08 '15 at 20:43

1 Answers1

0

You can define an Alphabet subclassing Bio.Alphabet.Alphabet. Maybe your case is similar to ThreeLetterAlphabet:

class ThreeLetterProtein(Alphabet): 
    """Three letter protein alphabet.""" 
    size = 3 
    letters = [ 
        "Ala", "Asx", "Cys", "Asp", "Glu", "Phe", "Gly", "His", "Ile",
        "Lys", "Leu", "Met", "Asn", "Pro", "Gln", "Arg", "Ser", "Thr",
        "Sec", "Val", "Trp", "Xaa", "Tyr", "Glx",
        ]

The second part of the question is confusing. If you want Biopython to do the alignment, AFAIK you cannot do it other than Bio.pairwise2. Biopython has only wrappers to popular alignment tools such as Muscle, Clustal, TCoffee... Then it can read the outputs as MultipleSeqAlignment with Bio.AlignIO.

xbello
  • 7,223
  • 3
  • 28
  • 41
  • I managed to define the Alphabet, but `Bio.pairwise2` ignores the size parameter. This is also the case for the `ThreeLetterProtein()`, ``` tla = Alphabet.ThreeLetterProtein() tls1 = SeqRecord(Seq("AlaAlaPheAlaAla", tla)) tls2 = SeqRecord(Seq("AlaAlaProAlaAla", tla)) align = pairwise2.align.globalxx(tls1,tls2) ``` Results in `AlaAlaP-heAlaAla` and `AlaAlaPro-AlaAla`. – Marc Osterland Oct 13 '15 at 14:17
  • `Bio.pairwise2` won't align any sequences other than single letter sequences. What I said is that Biopython doesn't "align" except with that particular module. Maybe you should try some "word alignment" algorithm outside Biopython. – xbello Oct 13 '15 at 15:56