I have a population of N
chromosomes that can all be represented by binary strings of size L
. N
is typically of the size of 1e4 (plus or minus two orders of magnitude). L
can vary a lot but can go up to 1e7
. For the moment, I am recording all this information using L
bits for each N
chromosomes but this consumes too much memory. I am looking for a better data structure.
We can take advantage of the fact that the N
chromosomes are not randomly dispersed in the 2^L
space of possibility. They tend to be very clustered. In other words the average hamming distance is typically much much smaller than L/2. Let's imagine doing a PCA of our chromosomes, it might look like that
Over discrete time steps, all N
chromosomes are replaced by other N
chromosomes. There are "mutation" (one mutation changes one bit at a time). Hence, the population of chromosomes evolves so that later on, they might look more like that for example
Now, the problem is that it is not impossible that the N
chromosomes form two (or more) groups
Plus, it is possible that some "offsprings" chromosomes are created as "hybrid" from two "parent" chromosomes. If the two parent chromosomes belong to two different groups (groups as seen on the PCA graph), then it makes things a little harder. For example, assuming L=15
, we may have the two parent chromosomes
[0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,0] // Let's call it 'Alice'
[1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1] // Let's call it 'Bob'
"giving birth" to chromosome
[0,0,0,1,0,0,0,0,1,0,0,1,1,1,1,1] // This is Charlie, son of a mixture of Alice and Bob
// Alice <- ^ -> Bob
In practice we can have lots of these different hybrids. And hybrids can further hybridize. But still the parameter space explore is still very minimal compared to the 2^L possibilities, so there should be a way to not having to use N*L bits to represents the N chromosomes.
What data structure could I use to represent the N
chromosomes so as to minimize memory usage?
I was thinking I could have a few reference chromosomes and refer to all other chromosomes by how many differences they carry to the reference chromosomes.
I could also place all the chromosomes at the tips of a B+ tree and each branch of the tree lists how many differences it has from the reference chromosomes. The very base of the B+ tree could be [0,0,0,0,...0,0]
for example. Every few time steps, I could recompute the entire B+ tree to clean things up. Now the existence of hybrids will be a problem with this solution. Also, I am wondering whether in a B+ tree I want to allow reversal mutations. Maybe, I should allow some branches of the B+ tree to only consider subparts of the chromosome (like the first L/5 bits). In all cases, I am unsure the details of how all of that would be implemented (like how the B+ tree would be recomputed to clean it up).