-1

I have a fixed 32 bits in which to store as much DNA as possible. The amount of space required to store 1 character of DNA ('A', 'C', 'G' or 'T') is 2 bits (00, 01, 10, 11, as there are only 4 combinations).

To store up to 2 characters, (so, A, C, G, T, AA, AC, ..., GG) there are 20 possible combinations, which we can work out with the function ((4**(x+1))-2)/(4-1), where x is the maximum length of DNA we want to store. 16 characters of DNA would therefore have 5,726,623,060 combinations, however in 32 bits I can only store up to 4,294,967,296 numbers (2**32).

So long story short, in 32 bits the maximum amount of variable-length DNA one can store is 15 letters (1,431,655,764 combinations).

So, next step is to make a function which takes up to 15 letters of DNA as a string, and turns it into a number. It doesn't matter which number ('A' could be 0, it could be 1, it could be 1332904, it really doesn't matter) so long as we can reverse the function and get the number back to 'A' later.

I started to solve this by making a dictionary of key, value pairs containing 1,431,655,764 elements, but quickly ran out of RAM. This is why I need a translation function from string to int.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
J.J
  • 3,459
  • 1
  • 29
  • 35
  • 3
    Once having established that you can store each character in two bits, I am not sure what part you are having difficulty with. Do you have any code? – khelwood Oct 14 '15 at 14:25
  • 1
    2 bits per letter only works when the length of the string is constant. For example, in 4 bits I can represent 'AA' with 0000, but i cannot represent 'A' on its own. There are 20 combinations of DNA, and only 16 values can be stored in 4 bits. – J.J Oct 14 '15 at 14:27
  • 4
    In that case you have extra information (i.e. the length of the string) which will require extra space to be encoded. – khelwood Oct 14 '15 at 14:32
  • 5
    "Why would you down vote and a submit a close vote for a post with no previous solution and no reason why!?" because your question boils down to "give me the codez" – Adam Smith Oct 14 '15 at 14:34
  • @khelwood - I thought i mentioned that when i wrote "UP TO" in all caps, and then provided a list of example strings... AdamSmith - dont all questions tagged under a programming language boil down to that? I'll take a hint of a solution if you have one! Also i said that i tried making a lookup table in a dictionary, and other than that i literally have no idea where to even start, which is why i came here... – J.J Oct 14 '15 at 14:35
  • @J.J khelwood already gave you that. encode the length of the string followed by your bytes of ACGT. – Adam Smith Oct 14 '15 at 14:37
  • 3
    Mostly this is a silly proposition because you're trying to use Python in a very memory-limited context. That's a bad idea for all the same reasons that you don't use watercolors for paint by numbers. It gets sloppy. Use a language that doesn't do its own memory management – Adam Smith Oct 14 '15 at 14:40
  • I would need 4 bits to encode the length, meaning there is only have 28 bits to encode the DNA. 28 bits can hold up to 268435456, or 13 letters of DNA. I would rather do it another way and store all 15 characters. – J.J Oct 14 '15 at 14:42
  • Note that the `""` takes 25 bytes in Python and `0` takes 12 bytes. It's not known for having the smallest memory footprint :) – Adam Smith Oct 14 '15 at 14:42
  • Python has numpy, and that is just as efficient as anything else at storing 32 bit unsigned ints. Im not sure where all this hostility towards a fairly simply question is coming from, haha, but theres really no need to bring Python into this ;) – J.J Oct 14 '15 at 14:43
  • @DSM yeah in retrospect I agree. Nominated for reopening – Adam Smith Oct 14 '15 at 15:26
  • Are you're representing a single sequence with *multiple* of these 32-bit representations with 2bits reserved? If so, note that for sequences of more than a few hundred bases it uses less memory to store the length separately, and just stuff 16 bases in each 32-bit area. – Andy Thomas Oct 15 '15 at 17:17

3 Answers3

4

Here is my suggestion.

If storing your letters takes 2 to 30 bits, you have at least 2 bits left to help you to deduce the length. Always add a 1 after the bits representing your characters, and pad the rest with zeroes. That way, if you look for the last 1 in your bit pattern, it will always be just after the end of your string.

E.g.

A  -> 00 (meaning 'A') followed by 1 (meaning 'end') followed by 29 zeroes
   -> 00100000000000000000000000000000
C  -> 01 (meaning 'C') followed by 1 (meaning 'end') followed by 29 zeroes
   -> 01100000000000000000000000000000
G  -> 10 (meaning 'G') followed by 1 (meaning 'end') followed by 29 zeroes
   -> 10100000000000000000000000000000
T  -> 11 (meaning 'T') followed by 1 (meaning 'end') followed by 29 zeroes
   -> 11100000000000000000000000000000
AA -> 0000 (meaning 'AA') followed by 1, followed by 27 zeroes
   -> 00001000000000000000000000000000
...
AAAAAAAAAAAAAAA
   -> 000000000000000000000000000000 (meaning the 'A's) followed by 1, followed by 1 zero
   -> 00000000000000000000000000000010

That ought to unambiguously represent your string and allow you up to 15 characters in 32 bits.

khelwood
  • 55,782
  • 14
  • 81
  • 108
2

Given a static ordering, you can refer to each individual permutation as a single number representing its order in the sequence. Then rather than building the dictionary you can convert it on the other side.

import itertools

def build_sequence(length):
    return itertools.product("ACTG", repeat=length)

def encode(sequence: str):
    seq = build_sequence(len(str))
    t_seq = list(sequence)  # to compare to seq
    for idx, s in enumerate(seq):
        if s == t_seq:
            return idx

def decode(e_sequence):
    max_length = math.ceil(math.log(e_sequence, 4))
    # max_length is the character count of ATCGs that e_sequence contains,
    #   since each sequence has 4**length elements
    seq = build_sequence(max_length)
    for _ in range(e_sequence):
        # skipping these is like indexing a list
        next(seq)
    return next(seq)

You can then pack that number into a smaller type and send it over the wire, unpack it again, and decode it.

import struct

packed = struct.pack("I", encode(some_sequence))
#  send it? I'm not sure what you're doing with it

rcvd_pack = b'SUUU'
unpacked = struct.unpack("I", rcvd_pack)
# becomes a tuple of the value
enc_seq = unpacked[0]

result = decode(enc_seq)

This should let you build 16 character sequences and pack them in 32 bits of data.

Adam Smith
  • 52,157
  • 12
  • 73
  • 112
  • 1
    Thank you for this Adam, i learnt a lot from figuring out how your code works :) Im putting it into a numpy array, but yeah id use struct exactly as you described. The only thing is that as you walk through the itertools.product, to encode 15 G's takes a very long time. But it does work, and is elegant. the math.ceil(math.log(seq,4)) was the secret sauce for me :) – J.J Oct 14 '15 at 16:20
  • @J.J agreed, it does. You should be able to math your way there, but I don't have the time to figure out how the cases work. It'll be something similar to your binary encoding, but in base 4 instead of base 2 (I think? I'm fuzzy on the math) – Adam Smith Oct 15 '15 at 14:52
1

Using Khelwood's info, I solved it with the following code:

b = {'A':0b00,'C':0b01,'G':0b10,'T':0b11}
t = {'00':'A','01':'C','10':'G','11':'T'}

def binarize(string):
    result = 0
    for char in (string + 'G').ljust(16,'A'): result = (result << 2) + d[char]
    return result

def textualize(value):
    result = ''
    for twobits in [ format(value, '032b')[i:i+2] for i in range(0,32,2) ]:
        result += t[twobits]
    return result.rstrip('A')[:-1]

>>> binarize('TTTTTTTTTTTTTTT')
    4294967294

>>> textualize(4294967294)
    u'TTTTTTTTTTTTTTT'

I'm sure there is a more efficient way of doing all this, which I think I will need to do as i'll be encoding and decoding billions of times, but for now it at least works :)

J.J
  • 3,459
  • 1
  • 29
  • 35