Estimate Alphabet in Biopython from fasta file

Question

I am looking for a way to read a .fasta file in Biopython and have the package estimate if we are dealing with DNA, RNA or proteins. So far, I read data like this:

with open('file.fasta', 'r') as f:
    for seq in sio.parse(f, 'fasta'):
        # do stuff, depending on alphabet

My problem is now that I do not know what kind of sequences I will find in the .fasta file. It can be either proteins, DNA or RNA but I have to know the number of letters in the alphabet.

Is there a way to "estimate" the alphabet from the sequences with Biopython? I am aware that one could have a protein containing just the letters ACGT which is why I would like to estimate the alphabet.

score 6 · Accepted Answer · answered Jan 11 '17 at 17:00

This is very difficult to do for small sequences. For example, the sequence ACGCGACAGA can be both a DNA, an RNA or a protein sequence, since the letters A, C and G are common to all three alphabets. Without other knowledge it is impossible to estimate which is the best match.

The following code will print out all the alphabets that the first record in a given FASTA file belongs to:

from Bio import SeqIO
from Bio.Alphabet.IUPAC import *

alphabets = [extended_protein,  ambiguous_dna, unambiguous_dna, extended_dna, ambiguous_rna, unambiguous_rna]

def validate(seq, alphabet):
    "Checks that a sequence only contains values from an alphabet"
    # inspired by https://www.biostars.org/p/102/
    leftover = set(str(seq).upper()) - set(alphabet.letters)
    return not leftover

with open("example.fasta") as handle:
    first_record = list(SeqIO.parse(handle, "fasta"))[0]
    valid_alphabets = [str(alphabet) for alphabet in alphabets if validate(first_record.seq, alphabet)]
    print("Valid alpahabet(s) for fasta file: {}".format(', '.join(valid_alphabets)))

So, for the sequence ACGCGACAGA this will print:

Valid alpahabet(s) for fasta file: ExtendedIUPACProtein(), IUPACAmbiguousDNA(), IUPACUnambiguousDNA(), ExtendedIUPACDNA(), IUPACAmbiguousRNA(), IUPACUnambiguousRNA()

But for the sequence MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVFX it will print:

Valid alpahabet(s) for fasta file: ExtendedIUPACProtein()

This already solves my problem. Thank you. I know that I couldn't expect more but in most cases when I have many sequences, your code pretty much nails it down. — romeasy, Jan 12 '17 at 12:55

Estimate Alphabet in Biopython from fasta file

1 Answers1