Pairwise alignment of multi-FASTA file sequences

Question

I have multi-FASTA file containing more than 10 000 fasta sequences resulted from Next Generation Sequencing and I want to do pairwise alignment of each sequence to each sequence inside the file and store all the results in the same new file in order to perform clustering analysis after. Example of FASTA sequence and my code for performing pairwise sequence alignment with python is written below.

FASTA sequence

>m180921_230442_42149_c101464342550000001823297908121882_s1_X0/538/ccs
AGAAGCCACTCATCCATCCAGGCAGGAAGACTCTTAGGATCCTGACTTTCTCCTGGTCCCCACATCCCCT
AAACCGAGGAAGGGGTCCAGCAGGGTCCGAGTCCCTGAAGCAAGGATTCTCCGTGGTCGTGTCCCCACAG

Please disregard the first line as it contains description summary of the sequence.

My code

    from Bio import pairwise2
    from Bio.pairwise2 import format_alignment

    X = "ACGGGT"
    Y = "ACG"

    #match score = 2, mismatch score = -1, gap opening = -5, gap extension = -2
    alignments = pairwise2.align.globalms(X, Y, 2, -1, -5, -2)

    for a in alignments:
        print(format_alignment(*a))

The problem

I am wondering how can I modify it to loop over whole multi-FASTA file rather than just a code sequence. Also: How do I efficiently store the results as needed.

Hi Aurora, welcome to SO! May you please edit your code and explain for the wider audience briefly what [FASTA](https://en.wikipedia.org/wiki/FASTA). Also, I am not quite sure how your problem and your code relate: I do not see any procedure for storing yet in your code. — B--rian, Aug 05 '19 at 16:19
@B--rian Thanks for suggestions, I edited post, but for the code I have no idea how to proceede next. — Aurora, Aug 06 '19 at 07:34
Also: Does https://stackoverflow.com/questions/42334644/ answer your question? — B--rian, Aug 06 '19 at 07:58
You can read your FASTA file with `SeqIO.parse()` and generate the combinations via one of the itertools modules. However, I'm not sure the Biopython pairwise aligner is best suited to clustering- why not use `Bio.Clustuer`? Or an external clustering tool like mmseqs or cd-hit? — Chris_Rands, Aug 07 '19 at 11:51
@Chris_Rands I wanted to use Biopython just for pairwise alignment and then proceede to clustering with some other moduels so that in the end I can generate consensus sequence. Thanks for suggestions — Aurora, Aug 07 '19 at 12:30
If this is just a learning exercise then go ahead, but tools like mmseqs and cd-hit do generate a consensus/representative sequence — Chris_Rands, Aug 07 '19 at 15:21
@Chris_Rands I will try with cd-hit then. But I have thousands of reads from 12 different sites stored in one FASTA file, is that a problem? — Aurora, Aug 08 '19 at 10:25

xbello · Answer 1 · 2019-08-20T07:24:15.000

First step: generate the pairs to be aligned

You probably want to align sequences only one time against each other, e.g. if you have sequences 1, 2 and 3, you only want to align 1vs2, 1vs3 and 2vs3 (i.e. all combinations), discarding 2vs1 and 3vs2 and self-alignments. This will save you some running time:

from itertools import combinations
from Bio import SeqIO

combinations(SeqIO.parse(infile, "fasta"), 2)

Step 2: align the pairs generated

The function pairwise2.align.globalms returns a tuple of (seqA, seqB, score, begin, end). We need to create SeqRecord objects from this tuples to be able to save it to files, adding the Scores as a description and preserving name and id:

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

def align(pair):
    """Yield two Seqs aligned."""
    al = pairwise2.align.globalms(pair[0].seq, pair[1].seq, 2, -1, -5, -2)[0]

    yield SeqRecord(Seq(al[0]), id=pair[0].id, name=pair[0].name,
                    description=f"Score={al[2]}")
    yield SeqRecord(Seq(al[1]), id=pair[1].id, name=pair[1].name,
                    description=f"Score={al[2]}")

Step 3: sewing it all together

You can see the above functions are generators. Biopython writer deals with generated sequences cleanly, so you can just ask for pairs generated in the first function, send it to the align and write the yielded SeqRecords to an open handle:

input = "your_fasta.fas"

with open("output.fas", "w") as output:
    for pair in combinations(SeqIO.parse(infile, "fasta"), 2):
        SeqIO.write(align(pair), output, "fasta")

Pairwise alignment of multi-FASTA file sequences

FASTA sequence

My code

The problem

1 Answers1