Generating DNA sequence excluding specific sequence

Question

I've just started to learn programming with python. In class we were asked to generate a random DNA sequence, that does NOT contain a specific 6-letter sequence (AACGTT). The point is to make a funtion that always return a legal sequence. Currently my function generates a correct sequence about 78% of the time. How can I make it return a legal sqeuence 100% of the time? Any help is appreciated.

Here is what my code looks like for now:

from random import choice
def generate_seq(length, enzyme):
    list_dna = []
    nucleotides = ["A", "C", "T", "G"]
    i = 0
    while i < 1000:
        nucleotide = choice(nucleotides)
        list_dna.append(nucleotide)
        i = i + 1

    dna = ''.join(str(nucleotide) for nucleotide in list_dna)
    return(dna) 


seq = generate_seq(1000, "AACGTT")
if len(seq) == 1000 and seq.count("AACGTT") == 0:
    print(seq)

In your 22% of cases which are not legal, what makes the data not legal? — JacobIRR, Nov 12 '18 at 18:15
If the only goal is to have a DNA sequence without that 6-letter sequence, just change `nucleotides = ["A"]` and be technically correct (the best kind of correct). You could also just remove that sequence if you happen to generate it. Or you could constrain in a different way. It depends on what you want to do with this sequence later. — CJR, Nov 12 '18 at 18:15
Your DNA sequence can legally contain AACGT, however to keep it legal the next base chosen should be picked from [A, C, G] only since T is not allowed. — rossum, Nov 12 '18 at 18:17

score 1 · Accepted Answer · answered Nov 12 '18 at 18:26

One option is to check your last few entries in your loop and only keep appending if the 'bad' sequence hasn't been created. However, this may result in a higher than true-random chance of having the "AACGT" sequence, just with a different letter instead of the last "T"

from random import choice
def generate_seq(length, enzyme):
    list_dna = []
    nucleotides = ["A", "C", "T", "G"]
    i = 0
    while i < 1000:
        nucleotide = choice(nucleotides)
        list_dna.append(nucleotide)
        #check for invalid sequence. If found, remove last element and redraw
        if ''.join(list_dna[-6:]) == "AACGTT":
            list_dna.pop()
        else:
            i = i + 1

    dna = ''.join(str(nucleotide) for nucleotide in list_dna)
    return(dna) 


seq = generate_seq(1000, "AACGTT")
if len(seq) == 1000 and seq.count("AACGTT") == 0:
    print(seq)

score 1 · Answer 2 · answered Nov 12 '18 at 18:34

One idea is to check if the previous 5 nucleotides were equals to AACGT, in that case pick only from ["A", "C", "G"].

from random import choice


def generate_seq(length, enzyme, bad_prefix="AACGT"):
    list_dna = []
    nucleotides = ["A", "C", "T", "G"]
    i = 0
    while i < 1000:
        if list_dna[-5:] != bad_prefix:
            nucleotide = choice(nucleotides)
        else:
            nucleotide = choice(["A", "C", "G"])
        list_dna.append(nucleotide)
        i = i + 1

    dna = ''.join(str(nucleotide) for nucleotide in list_dna)
    return dna


seq = generate_seq(1000, "AACGTT")
if len(seq) == 1000 and seq.count("AACGTT") == 0:
    print(seq)

Generating DNA sequence excluding specific sequence

2 Answers2