I have a fixed 32 bits in which to store as much DNA as possible. The amount of space required to store 1 character of DNA ('A'
, 'C'
, 'G'
or 'T'
) is 2 bits (00
, 01
, 10
, 11
, as there are only 4 combinations).
To store up to 2 characters, (so, A
, C
, G
, T
, AA
, AC
, ..., GG
) there are 20 possible combinations, which we can work out with the function ((4**(x+1))-2)/(4-1)
, where x
is the maximum length of DNA we want to store. 16 characters of DNA would therefore have 5,726,623,060 combinations, however in 32 bits I can only store up to 4,294,967,296 numbers (2**32).
So long story short, in 32 bits the maximum amount of variable-length DNA one can store is 15 letters (1,431,655,764 combinations).
So, next step is to make a function which takes up to 15 letters of DNA as a string, and turns it into a number. It doesn't matter which number ('A'
could be 0
, it could be 1
, it could be 1332904
, it really doesn't matter) so long as we can reverse the function and get the number back to 'A'
later.
I started to solve this by making a dictionary of key, value
pairs containing 1,431,655,764 elements, but quickly ran out of RAM. This is why I need a translation function from string to int.