Python numpy efficiently combining arrays

Question

My question might sound biology heavy, but I am confident anyone could answer this without any knowledge of biology and I could really use some help.

Suppose you have a function, create_offspring(mutations, genome1, genome2), that takes a list of mutations, which are in the form of a numpy 2d arrays with 5 rows and 10 columns as such ( each set of 5 vals is a mutation):

    [ [4, 3, 6 , 7, 8], [5, 2, 6 , 7, 8] ...]

The function also takes two genomes which are in the form of numpy 2d arrays with 5 rows and 10 columns. The value at each position in the genomes is either 5 zeros at places where a mutation hasn't occurred, or filled with the values corresponding to the mutation list for spots where a mutation has occurred. The follow is an example of a genome that has yet to have a mutation at pos 0 and has a mutation at position 1 already.

    [ [0, 0, 0 , 0, 0], [5, 2, 5 , 7, 8] ...]

What I am trying to accomplish is to efficiently ( I have a current way that works but it is WAY to slow) generate a child genome from my two genomes that is a numpy array and a random combination of the two parent genomes(AKA the numpy arrays). By random combination, I mean that each position in the child array has a 50% chance of either being the 5 values at position X from parent 1 genome or parent 2. For example if parent 1 is

[0,0,0,0,0], [5, 2, 6 , 7, 8] ...]

and parent 2 is

[ [4, 3, 6 , 7, 8], [0, 0, 0 , 0, 0] ...]

the child genome should have a 50% chance of getting all zeros at position 1 and a 50% chance of getting [4, 3, 6 , 7, 8] etc..

Additionally, there needs to be a .01% chance that the child genome gets whatever the corresponding mutation is from the mutation list passed in at the beginning.

I have a current method for solving this, but it takes far too long:

    def create_offspring(mutations, genome_1, genome_2 ):
        ##creates an empty genome
        child_genome = numpy.array([[0]*5] * 10, dtype=np.float)
        for val in range(10):
            random = rand()
            if random < mutation_rate:
                child_genome[val] = mutation_list[val]
            elif random > .5:
                child_genome[val] = genome1[val]
            else:
                child_genome[val] = genome2[val]

        return child_genome

Are you sure that this is the part that's slow? How slow is slow? How fast do you need it to run? — Joan Smith, Mar 17 '14 at 04:26
Yes I am pretty sure. For the sake of making the question less confusing I said the array is only 10 mutations long, but in reality I am working with genomes of size >10000 and after timing everything in my code this part is slowing it down the most. It needs to be able to run the function ~ 20000000 times within a day or so. If you could help me figure out how to speed it up that would be awesome. — guitarsolos12345, Mar 17 '14 at 04:31

score 1 · Answer 1 · answered Mar 17 '14 at 04:42

Thanks for the clarification in the comments. Things work differently with 10000 than with 10 :)

First, there's a faster way to make an empty (or full) array:

np.zeros(shape=(rows, cols), dtype=np.float)

Then, try generating a list of random numbers, checking each of them simultaneously, and then working from there.

randoms = np.rand(len(genome))
half = (randoms < .5)

for val, (rand, half) in enumerate(zip(randoms, half)):
      your_code

This will at least speed the random number generation. I'm still thinking on the rest.

aha, awesome. Let me know if you need anymore clarifications as I realize with all the biology talk I might have missed something important. — guitarsolos12345, Mar 17 '14 at 04:46

Python numpy efficiently combining arrays

1 Answers1