2

This is one of those questions where I stumbled upon the right answer, but I don't understand why it's the right one and Wikipedia didn't help. For Rosalind, I wrote a simple script for getting the number of all the possible RNA sequences from a protein string (modulo 1,000,000). I know it's not the most efficient possible code (partly because I recycle bits from previous things I've made), but here it is:

protein = """<large protein string>"""
protein = ''.join(protein.split('\n'))

translate = {'UUU' : 'F','CUU' : 'L','AUU' : 'I','GUU' : 'V','UUC' : 'F','CUC' : 'L','AUC' : 'I','GUC' : 'V','UUA' : 'L','CUA' : 'L','AUA' : 'I','GUA' : 'V','UUG' : 'L','CUG' : 'L','AUG' : 'M','GUG' : 'V','UCU' : 'S','CCU' : 'P','ACU' : 'T','GCU' : 'A','UCC' : 'S','CCC' : 'P','ACC' : 'T','GCC' : 'A','UCA' : 'S','CCA' : 'P','ACA' : 'T','GCA' : 'A','UCG' : 'S','CCG' : 'P','ACG' : 'T','GCG' : 'A','UAU' : 'Y','CAU' : 'H','AAU' : 'N','GAU' : 'D','UAC' : 'Y','CAC' : 'H','AAC' : 'N','GAC' : 'D','UAA' : 'Stop','CAA' : 'Q','AAA' : 'K','GAA' : 'E','UAG' : 'Stop','CAG' : 'Q','AAG' : 'K','GAG' : 'E','UGU' : 'C','CGU' : 'R','AGU' : 'S','GGU' : 'G','UGC' : 'C','CGC' : 'R','AGC' : 'S',
'GGC' : 'G','UGA' : 'Stop','CGA' : 'R','AGA' : 'R','GGA' : 'G','UGG' : 'W','CGG' : 'R','AGG' : 'R','GGG' : 'G',
}
aminos = translate.values()
sample = [l for l in protein] + ['Stop']

score = []
for s in sample:
    c = aminos.count(s)
    score.append(c)

import math
result = reduce(lambda x, y: x*y, score) % 1000000
print result

This computes the total number of RNA sequences and takes the modulo of the final result (or so I think). I got the wrong answer twice before I decided to try this:

import math
result = reduce(lambda x, y: x*y % 1000000, score)
print result

This apparently produced the correct answer. Why does a modulo have to be performed at every x*y? Am I not understanding modulo or am I not understanding Python?

EDIT: Sorry, typo.

thefourtheye
  • 607
  • 1
  • 6
  • 19
  • 3
    `score % 1000000` makes no sense. `score` is a list. That shouldn't work. – user2357112 Aug 22 '13 at 21:11
  • Mathematically, these should be equivalent: `(a * b * c) % d`, `(((a * b) % d) * c) % d` – Eric Aug 22 '13 at 21:25
  • 1
    This makes more sense, but aside from perhaps running a lot faster and not crashing with an `OutOfMemoryError`, I don't see why it'd produce different results from the first version. – user2357112 Aug 22 '13 at 21:28
  • And yet the two versions of code I presented (post-edit) give me different answers. – thefourtheye Aug 22 '13 at 21:30
  • 1
    Could you reduce the large protein string to a smaller protein string that shows the same problem and paste it here, so we can reproduce it and debug? –  Aug 22 '13 at 21:38
  • On an unrelated note, you can cut down your memory footprint with `score = (aminos.count(s) for s in sample)` – Eric Aug 22 '13 at 21:45
  • also what exact version of python, what 'bitiness' and on what hardware. It feels like you might be blowing a 32 bit int, but then this is python.... Would also be potentially useful to grab the output of the reduce in the buggy version and print that before the modulo. – tolanj Aug 22 '13 at 21:51

1 Answers1

2

The difference between

reduce(lambda x, y: x*y, score) % 1000000

and

reduce(lambda x, y: x*y % 1000000, score)

Is that the first has to work with longs up to the product of all the values in score whereas the second will work with values no larger than max(score) * 999999.

Arbitrarily large integers cannot be stored in finite memory, nor can their product be calculated in constant time, so you're far more likely to hit an OutOfMemoryError or take a very long time with the first option

Eric
  • 95,302
  • 53
  • 242
  • 374
  • You know, that's probably it. I checked the result without modulo and the int was crazy big. I suspected it was a programming and not a math question after all. – thefourtheye Aug 22 '13 at 21:55
  • @thefourtheye: Still, it shouldn't cause loss of accuracy - python doesn't have integer overflow – Eric Aug 23 '13 at 11:08