Faster way of calculating the percentage of identical sites in alignment using biopython

Question

I developed the following code to calculate the number of identical sites in an alignment. Unfortunately the code is slow, and I have to iterate it over hundreds of files, it takes close to 12 hours to process more than 1000 alignments, meaning that something ten times faster would be appropriate. Any help would be appreciated:

import os
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import AlignIO
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_dna
from Bio.Align import MultipleSeqAlignment
import time

a = SeqRecord(Seq("CCAAGCTGAATCAGCTGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTAAGAGTTCCTTTCGAGCACTACAAGAAGACGATTCGCGCGAACCACCGCAT", generic_dna), id="Alpha")
b = SeqRecord(Seq("CGAAGCTGACTCAGTGGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACAATTCGTGCGAACCACCGCAT", generic_dna), id="Beta")
c = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCAGAATCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Gamma")
d = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCAGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Delta")
e = SeqRecord(Seq("CGAAGCTGACTCAGTTGGCGGAGTCACTGAAACTGGAGCACCAGTTCCTCAGAGTCCCCTTCGAGCACTACAAGAAGACGATTCGTGCGAACCACCGCAT", generic_dna), id="Epsilon")

align = MultipleSeqAlignment([a, b, c], annotations={"tool": "demo"})

start_time = time.time()
if len(align) != 1:
    for n in range(0,len(align[0])):
        n=0
        i=0
        while n<len(align[0]):    #part that needs to be faster
            column = align[:,n]
            if (column == len(column) * column[0]) == True:
                i=i+1
            n=n+1

    match = float(i)
    length = float(n)
    global_identity = 100*(float(match/length))
    print(global_identity)

print("--- %s seconds ---" % (time.time() - start_time))

Did you measure, which part of the code takes so long to process? If it's the `MultipleSeqAlignment` calculation, little can be offered. If it's your loop below, it's a different deal. How long is the `align` list? — 9000, Mar 12 '18 at 17:49
Files come already aligned, the alignment in the code above is just for reproducibility purpose. That being said the part that takes longest is: 'while n — omv, Mar 12 '18 at 18:11
I have a folder with ~1600 alignments, and this code takes ~12 hours to calculate the percentage of identical sites. — omv, Mar 12 '18 at 18:13
The shape of `aling` is five strings of a 100 characters, while the shape of `aling[0]` is basically a column composed of one character for each of the strings (5 characters, one each for every string/sequence). Hope that I am understanding your question, thanks for helping :). — omv, Mar 12 '18 at 19:34
So, you're trying to check that each of the 5 strings have the same characters in the column? If the characters in the column all match, you increment `i`, else you increment `n`. Is this correct? What do you do when the strings are of different length? I suppose it should register as a mismatch? — 9000, Mar 12 '18 at 19:42
Your interpretation of the code is correct. If the strings have different lengths `MultipleSeqAlignment` will add a gap using the character `-` and that column will be considered a mismatch. I believe the code is slow because when the `align` is long ~2000 characters, it gets slow checking every column — omv, Mar 12 '18 at 19:48

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

So, you're trying to check that each of the 5 strings have the same characters in the column? If the characters in the column all match, you increment i, else you increment n.

Your interpretation of the code is correct.

Based on the above, I'd hazard to suggest the following code as a faster alternative.

I suppose that align is a structure like this:

align = [
  'AGCTCGCGGAGGCGCTGCT....',
  'ACCTCGGAGGGCTGCTGTAC...',
  'AGCTCGGAGGGCTGCTGTAC...',
  # possibly more ...
]

We try to detect the columns of same characters in it. Above, the first column is AAA (a match), the next is GCG (a mismatch).

def all_equal(items):
    """Returns True iff all items are equal."""
    first = items[0]
    return all(x == first for x in items)


def compute_match(aligned_sequences):
    """Returns the ratio of same-character columns in ``aligned_sequences``.

    :param aligned_sequences: a list of strings or equal length.
    """
    match_count = 0
    mismatch_count = 0
    for chars in zip(*aligned_sequences):
        # Here chars is a column of chars, 
        # one taken from each element of aligned_sequences.
        if all_equal(chars):
            match_count += 1
        else:
            mismatch_count += 1
    return float(match_count) / float(mismatch_count)
    # What would make more sense:
    # return float(matches) / len(aligned_sequences[0])

An even shorter version:

def compute_match(aligned_sequences):
    match_count = sum(1 for chars in zip(*aligned_sequences) if all_equal(chars))
    total = len(aligned_sequences[0])
    mismatch_count = total - match_count  # Obviously.
    return ...

Thank you! in the example the time decreased from `0.101566076279 seconds` to `0.000661849975586 seconds` with your code. And the overall time of the scrip to go over the 1600 files decreased from `12 hours` to `18.1031439304 seconds` once I implemented your code — omv, Mar 12 '18 at 21:14
Great to hear. I hope the result is still correct! :) I mean, check a few results manually / visually, just in case. I don't see the real data, can't test well. — 9000, Mar 12 '18 at 21:19
I checked it against the old run and it is correct ヽ(´▽`)/ — omv, Mar 12 '18 at 21:23

Faster way of calculating the percentage of identical sites in alignment using biopython

1 Answers1