I have a FASTA file with a bunch of sequences with the following format:
BMRat|XM_008846946.1 ATGAAGAACATCACAGAAGCCACCACCTTCATTCTCAAGGGACTCACAGACAATGTGGAACTACAGGTCA TCCTCTTTTTTCTCTTTCTAGCGATTTATCTCTTCACTCTCATAGGAAATTTAGGACTTATTATTTTAGT TATTGGGGATTCAAAACTCCACAACCCTATGTACTGTTTTCTGAGTGTATTGTCTTCTGTAGATGCCTGC TATTCCTCAGACATCACCCCGAATATGTTAGTAGGCTTCCTGTCAAAAAACAAAGGCATTTCTCTCCATG GATGTGCAACACAGTTGTTTCTCGCTGTTACTTTTGGAACCACAGAATGCTTTCTGTTGGCGGCAATGGC TTATGACCGCTATGTAGCCATCCATGACCCACTTCTCTATGCAGTGAGCATGTCACCAAGGATCTATGTG CCGCTCATCATTGCTTCCTATGCTGGTGGAATTCTGCATGCGATTATCCACACCGTGGCCACCTTCAGCC TGTCCTTCTGTGGATCTAATGAAATCAGTCATATATTCTGTGACATCCCTCCTCTGCTGGCTATTTCTTG TTCTGACACTTACATCAATGAGCTCCTGTTGTTCTTCTTTGTGAGCTCCATAGAAATAGTCACTATCCTC ATCATCCTGGTCTCTTATGGTTTCATCCTTATGGCCATTCTGAAGATGAATTCAGCTGAAGGGAGGAGAA AAGTCTTCTCTGCATGTGGGTCTCACCTAACTGGAGTGTCCATTTTCTATGGGACAAGCCTTTTCATGTA TGTGAGACCAAGCTCCAACTATTCCTTGGCACATGACATGGTAGTGTCGACATTTTATACCATTGTGATT CCCATGCTGAACCCTGTCATCTACAGTCTGAGGAACAAAGATGTGAAAGAGGCAATGAGAAGATTTTTGA AGAAAAATTTTCAGAAACTTTAA
The code implemented using biopython http://biopython.org/wiki/Seq allows me to find the longest sequence of amino acids that starts with Methionine and ends with a Stop codon, of each sequence in the FASTA file.
The function is find_largest_polypeptide_in_DNA
. Basically it translates the DNA sequence to an amino acid sequence using the 3 different forward reading frames, and in the variable allPossibilities
it saves the segments that starts with M (a particular amino acid) and end in a stop codon. Then it compares the lengths of the possibilities and selects the longest possibility, returning the protein sequence of that segment.
def find_largest_polypeptide_in_DNA(seq, translationTable=1):
allPossibilities = []
for frame in range(3):
trans = str(seq[frame:].translate(translationTable))
framePossibilitiesF = [i[i.find("M"):] for i in trans.split("*") if "M" in i]
allPossibilities += framePossibilitiesF
allPossibilitiesLengths = [len(i) for i in allPossibilities]
if len(allPossibilitiesLengths) == 0:
raise Exception("no candidate ORFs")
proteinAsString = allPossibilities[allPossibilitiesLengths.index(max(allPossibilitiesLengths))]
return Seq(proteinAsString, alphabet=ProteinAlphabet)
It works perfect, but now I want to get the DNA sequence that corresponds to that sequence of proteins returned by the function. I need to add some lines to the function in order to get both sequences but I don't really know how. I dont know if it's possible to track the position of each Methionine of the i.find("M") and then use that position to track it in the nucleotide sequence.
Thanks.