Inferring mRNA from Protein Rosalind

Question

Problem

For positive integers aa and nn, aa modulo nn (written amodnamodn in shorthand) is the remainder when aa is divided by nn. For example, 29mod11=729mod11=7 because 29=11×2+729=11×2+7.

Modular arithmetic is the study of addition, subtraction, multiplication, and division with respect to the modulo operation. We say that aa and bb are congruent modulo nn if amodn=bmodnamodn=bmodn; in this case, we use the notation a≡bmodna≡bmodn.

Two useful facts in modular arithmetic are that if a≡bmodna≡bmodn and c≡dmodnc≡dmodn, then a+c≡b+dmodna+c≡b+dmodn and a×c≡b×dmodna×c≡b×dmodn. To check your understanding of these rules, you may wish to verify these relationships for a=29a=29, b=73b=73, c=10c=10, d=32d=32, and n=11n=11.

As you will see in this exercise, some Rosalind problems will ask for a (very large) integer solution modulo a smaller number to avoid the computational pitfalls that arise with storing such large numbers.

Given: A protein string of length at most 1000 aa.

Return: The total number of different RNA strings from which the protein could have been translated, modulo 1,000,000. (Don't neglect the importance of the stop codon in protein translation.)

Sample Dataset

MA

Sample Output

12

My Answer is always set to zero. I realize that there is some issue with the way use mod but I don't know what exactly that is.

rna_table = {"UUU":"F", "UUC":"F", "UUA":"L", "UUG":"L",
"UCU":"S", "UCC":"s", "UCA":"S", "UCG":"S",
"UAU":"Y", "UAC":"Y", "UAA":"STOP", "UAG":"STOP",
"UGU":"C", "UGC":"C", "UGA":"STOP", "UGG":"W",
"CUU":"L", "CUC":"L", "CUA":"L", "CUG":"L",
"CCU":"P", "CCC":"P", "CCA":"P", "CCG":"P",
"CAU":"H", "CAC":"H", "CAA":"Q", "CAG":"Q",
"CGU":"R", "CGC":"R", "CGA":"R", "CGG":"R",
"AUU":"I", "AUC":"I", "AUA":"I", "AUG":"M",
"ACU":"T", "ACC":"T", "ACA":"T", "ACG":"T",
"AAU":"N", "AAC":"N", "AAA":"K", "AAG":"K",
"AGU":"S", "AGC":"S", "AGA":"R", "AGG":"R",
"GUU":"V", "GUC":"V", "GUA":"V", "GUG":"V",
"GCU":"A", "GCC":"A", "GCA":"A", "GCG":"A",
"GAU":"D", "GAC":"D", "GAA":"E", "GAG":"E",
"GGU":"G", "GGC":"G", "GGA":"G", "GGG":"G"}

with open("rosalind_mrna.txt") as myfile:

  data = myfile.readlines()

charData = list(data[0].strip())


frequency_list = {};

for k,v in rna_table.items():

  if not frequency_list.has_key(v):
    frequency_list[v] = 1
  else:
    frequency_list[v] += 1

answer = frequency_list['STOP'];
for aa in charData:
  answer = ((answer * frequency_list[aa]) % 1000000)

print "Answer is:\n"
print answer % 1000000

how can one not neglect the importance of the stop codon in protein translation ? — PRMoureu, Sep 09 '17 at 21:47
See the original problem at http://rosalind.info/problems/mrna/ - the formatting is a bit nicer. — Hugh Bothwell, Sep 09 '17 at 21:48
How does your input file look like? Is it possible that you have an empty line at the beginning? — Maximilian Peters, Sep 09 '17 at 22:01

Hugh Bothwell · Accepted Answer · 2017-09-10T01:15:11.367

A couple of minor issues:

You have a typo: "UCC":"s" should be "UCC":"S"
if not frequency_list.has_key(v): is more Pythonically written as if v not in frequency_list: (in fact .has_key has been removed from Python 3)
charData = list(data[0].strip()) - list() is not needed, you can iterate over the string directly just as well
frequency_list = {}; - the trailing semicolon is not needed
print answer % 1000000 - the final % 1000000 is unnecessary, it was already done in the loop calculating answer

So it looks like the example, MA, works like

(
    {number of ways to encode M == 1}
  * {number of ways to encode A == 4}
  * {number of ways to encode STOP == 3}
)
% 1000000

which gives 12, as stated.

Your code works for me as-is, returning 827968. I would take a look at your copy of rosalind_mrna.txt (make sure it's not a blank file) and check the contents of data (make sure it is loading the file correctly).

Just for comparison, I would write it as

from functools import reduce
from operator import mul

freq = {
    'A': 4, 'C': 2, 'D': 2, 'E': 2,
    'F': 2, 'G': 4, 'H': 2, 'I': 3,
    'K': 2, 'L': 6, 'M': 1, 'N': 2,
    'P': 4, 'Q': 2, 'R': 6, 'S': 6,
    'T': 4, 'V': 4, 'W': 1, 'Y': 2,
    'STOP': 3
}

def load_dna_file(fname):
    with open(fname) as inf:
        dna = inf.read()
    return "".join(dna.split())   # removes all whitespace

def num_rna_strings(dna, modulo=None):
    if modulo:
        reduce_fn = lambda a, b: (a * b) % modulo
    else:
        reduce_fn = mul
    freqs = (freq[base] for base in dna)
    return reduce(reduce_fn, freqs, freq["STOP"])

def main():
    dna = load_dna_file("rosalind_mrna.txt")
    num = num_rna_strings(dna, 1000000)
    print("Answer is {}".format(num))

if __name__ == "__main__":
    main()

For interest, if you do not use modulo, the full answer is 423 digits long (6.186 * 10 ** 422)

Inferring mRNA from Protein Rosalind

1 Answers1