I have multi-FASTA file containing more than 10 000 fasta sequences resulted from Next Generation Sequencing and I want to do pairwise alignment of each sequence to each sequence inside the file and store all the results in the same new file in order to perform clustering analysis after. Example of FASTA sequence and my code for performing pairwise sequence alignment with python is written below.
FASTA sequence
>m180921_230442_42149_c101464342550000001823297908121882_s1_X0/538/ccs
AGAAGCCACTCATCCATCCAGGCAGGAAGACTCTTAGGATCCTGACTTTCTCCTGGTCCCCACATCCCCT
AAACCGAGGAAGGGGTCCAGCAGGGTCCGAGTCCCTGAAGCAAGGATTCTCCGTGGTCGTGTCCCCACAG
Please disregard the first line as it contains description summary of the sequence.
My code
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
X = "ACGGGT"
Y = "ACG"
#match score = 2, mismatch score = -1, gap opening = -5, gap extension = -2
alignments = pairwise2.align.globalms(X, Y, 2, -1, -5, -2)
for a in alignments:
print(format_alignment(*a))
The problem
I am wondering how can I modify it to loop over whole multi-FASTA file rather than just a code sequence. Also: How do I efficiently store the results as needed.