How to Create multiple sequence alignments with fasta files rather then strings of protein sequences in biopython

Question

I want to be able to write a multiple sequence alignments using files I have downloaded in the same directory as my script. However in the Biopython Cookbook, the only way this is shown is via writing out strings rather then loading files. I would like to be able to do the latter. Here is how the multiple sequence alignment is made in Chapter 6.2 of The biopython cookbook

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment

align1 = MultipleSeqAlignment([
             SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"),
             SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"),
             SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),
         ])

The goal is to use this in order to make a phylum tree out of all the protein sequences.

score 0 · Answer 1 · answered Nov 30 '19 at 18:30

the example uses three SeqRecord objects that are created using the DNA strings provided. SeqIO.parse enables you to read files in e.g. fasta format and return SeqRecord objects for the alignment.

Example:

import os

from Bio import SeqIO
from Bio.Align import MultipleSeqAlignment

# files needs to be a list containing the filenames
# use e.g.
# files = [f for f in os.listdir() if 'fasta' in f]

records = []
for f in files:
    for record in SeqIO.parse(f, format='fasta'):
        records.append(record)

align1 = MultipleSeqAlignment(records)

Alternatively, if you already have sequence files you can concatenate the files into one and then use a tool like clustal omega in standalone or online mode.

How to Create multiple sequence alignments with fasta files rather then strings of protein sequences in biopython

1 Answers1