How to track the position of a start codon (ATG) in a nucleotide sequence after using the translate function of Biopython?

Question

I have a FASTA file with a bunch of sequences with the following format:

BMRat|XM_008846946.1 ATGAAGAACATCACAGAAGCCACCACCTTCATTCTCAAGGGACTCACAGACAATGTGGAACTACAGGTCA TCCTCTTTTTTCTCTTTCTAGCGATTTATCTCTTCACTCTCATAGGAAATTTAGGACTTATTATTTTAGT TATTGGGGATTCAAAACTCCACAACCCTATGTACTGTTTTCTGAGTGTATTGTCTTCTGTAGATGCCTGC TATTCCTCAGACATCACCCCGAATATGTTAGTAGGCTTCCTGTCAAAAAACAAAGGCATTTCTCTCCATG GATGTGCAACACAGTTGTTTCTCGCTGTTACTTTTGGAACCACAGAATGCTTTCTGTTGGCGGCAATGGC TTATGACCGCTATGTAGCCATCCATGACCCACTTCTCTATGCAGTGAGCATGTCACCAAGGATCTATGTG CCGCTCATCATTGCTTCCTATGCTGGTGGAATTCTGCATGCGATTATCCACACCGTGGCCACCTTCAGCC TGTCCTTCTGTGGATCTAATGAAATCAGTCATATATTCTGTGACATCCCTCCTCTGCTGGCTATTTCTTG TTCTGACACTTACATCAATGAGCTCCTGTTGTTCTTCTTTGTGAGCTCCATAGAAATAGTCACTATCCTC ATCATCCTGGTCTCTTATGGTTTCATCCTTATGGCCATTCTGAAGATGAATTCAGCTGAAGGGAGGAGAA AAGTCTTCTCTGCATGTGGGTCTCACCTAACTGGAGTGTCCATTTTCTATGGGACAAGCCTTTTCATGTA TGTGAGACCAAGCTCCAACTATTCCTTGGCACATGACATGGTAGTGTCGACATTTTATACCATTGTGATT CCCATGCTGAACCCTGTCATCTACAGTCTGAGGAACAAAGATGTGAAAGAGGCAATGAGAAGATTTTTGA AGAAAAATTTTCAGAAACTTTAA

The code implemented using biopython http://biopython.org/wiki/Seq allows me to find the longest sequence of amino acids that starts with Methionine and ends with a Stop codon, of each sequence in the FASTA file.

The function is find_largest_polypeptide_in_DNA. Basically it translates the DNA sequence to an amino acid sequence using the 3 different forward reading frames, and in the variable allPossibilities it saves the segments that starts with M (a particular amino acid) and end in a stop codon. Then it compares the lengths of the possibilities and selects the longest possibility, returning the protein sequence of that segment.

def find_largest_polypeptide_in_DNA(seq, translationTable=1):
    allPossibilities = []
    for frame in range(3):
        trans = str(seq[frame:].translate(translationTable))
        framePossibilitiesF = [i[i.find("M"):] for i in trans.split("*") if "M" in i]
        allPossibilities += framePossibilitiesF
    allPossibilitiesLengths = [len(i) for i in allPossibilities]

    if len(allPossibilitiesLengths) == 0:
        raise Exception("no candidate ORFs")

    proteinAsString = allPossibilities[allPossibilitiesLengths.index(max(allPossibilitiesLengths))]

    return Seq(proteinAsString, alphabet=ProteinAlphabet)

It works perfect, but now I want to get the DNA sequence that corresponds to that sequence of proteins returned by the function. I need to add some lines to the function in order to get both sequences but I don't really know how. I dont know if it's possible to track the position of each Methionine of the i.find("M") and then use that position to track it in the nucleotide sequence.

Thanks.

You want to modify the function so that it returns the DNA sequence rather than the amino acid sequence of the longest segement that starts with Met and ends STOP? — Bennett Brown, Apr 03 '18 at 02:41
Is there a reason you want to ignore the 3 reading frames going the other way? You know which strand your gene is transcribed from? — Bennett Brown, Apr 03 '18 at 02:54
Your question indicates the desired sequence must include a stop codon. The code you provide includes the segment at the end of each sequence in the FASTA file which begins with M and is unterminated. Do you want to include or exclude segments at the end of the FASTA sequence which start Met but are unterminated? — Bennett Brown, Apr 03 '18 at 03:15
Yes, i should read it in a reverse way too, thanks. @BennettBrown — Catalina Ardila Suarez, Apr 03 '18 at 03:26
I want the longest segment that begins with a M and ends with Stop codon @BennettBrown — Catalina Ardila Suarez, Apr 03 '18 at 03:28
Relevant script https://github.com/chris-rands/CR_bioinformatics_utilities/blob/master/scripts/faTranslateBioPython.py — Chris_Rands, Apr 03 '18 at 15:57

Bennett Brown · Accepted Answer · 2018-04-03T05:13:41.903

I think it would be easiest to write a new function following similar principles. Your idea "to track the position of each Methionine of the i.find('M')" is basically what's done below. The difficulty in doing this with the code you're starting with is that the sequences get chopped up with the split('*') and so the DNA starting position is the sum of the reading frame offset plus all the codons of segments previous to the sequence of concern. Per your clarification, I added an enclosing loop to iterate across forward and backward directions.

def find_largest_polypeptide_in_DNA(seq, translationTable=1):
    # Set the record to start with, then try to beat it
    longest_DNA = ''
    longest_amino_acid_sequence = 0

    for direction in [-1, 1]:
        forward_DNA = seq[::direction]
        # Check all three reading frames in this direction.
        for frame in range(3):
            trans = str(forward_DNA[frame:].translate(translationTable))
            cut_codons = 0
            while 'M' in trans:
                codons_before_Met = trans.find('M')
                cut_codons += codons_before_Met
                trans = trans[codons_before_Met:]
                if '*' in trans:
                    length = trans.find('*') + 1 
                    if length > longest_amino_acid_sequence:
                        longest_amino_acid_sequence = length
                        first_bp = frame + 3*cut_codons
                        last_bp = frame + 3*cut_codons + 3*(length)
                        longest_DNA = str(forward_DNA[first_bp:last_bp+1])
                    trans = trans[length:]
                else:
                    # Ignore sequence M... if ORF extends beyond FASTA?
                    trans = ''
    return longest_DNA

How to track the position of a start codon (ATG) in a nucleotide sequence after using the translate function of Biopython?

1 Answers1