This is very difficult to do for small sequences. For example, the sequence ACGCGACAGA
can be both a DNA, an RNA or a protein sequence, since the letters A
, C
and G
are common to all three alphabets. Without other knowledge it is impossible to estimate which is the best match.
The following code will print out all the alphabets that the first record in a given FASTA file belongs to:
from Bio import SeqIO
from Bio.Alphabet.IUPAC import *
alphabets = [extended_protein, ambiguous_dna, unambiguous_dna, extended_dna, ambiguous_rna, unambiguous_rna]
def validate(seq, alphabet):
"Checks that a sequence only contains values from an alphabet"
# inspired by https://www.biostars.org/p/102/
leftover = set(str(seq).upper()) - set(alphabet.letters)
return not leftover
with open("example.fasta") as handle:
first_record = list(SeqIO.parse(handle, "fasta"))[0]
valid_alphabets = [str(alphabet) for alphabet in alphabets if validate(first_record.seq, alphabet)]
print("Valid alpahabet(s) for fasta file: {}".format(', '.join(valid_alphabets)))
So, for the sequence ACGCGACAGA
this will print:
Valid alpahabet(s) for fasta file: ExtendedIUPACProtein(), IUPACAmbiguousDNA(), IUPACUnambiguousDNA(), ExtendedIUPACDNA(), IUPACAmbiguousRNA(), IUPACUnambiguousRNA()
But for the sequence MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVFX
it will print:
Valid alpahabet(s) for fasta file: ExtendedIUPACProtein()