What data structure to represent clustered dots within a Hamming space?

Question

I have a population of N chromosomes that can all be represented by binary strings of size L. N is typically of the size of 1e4 (plus or minus two orders of magnitude). L can vary a lot but can go up to 1e7. For the moment, I am recording all this information using L bits for each N chromosomes but this consumes too much memory. I am looking for a better data structure.

We can take advantage of the fact that the N chromosomes are not randomly dispersed in the 2^L space of possibility. They tend to be very clustered. In other words the average hamming distance is typically much much smaller than L/2. Let's imagine doing a PCA of our chromosomes, it might look like that

Over discrete time steps, all N chromosomes are replaced by other N chromosomes. There are "mutation" (one mutation changes one bit at a time). Hence, the population of chromosomes evolves so that later on, they might look more like that for example

Now, the problem is that it is not impossible that the N chromosomes form two (or more) groups

Plus, it is possible that some "offsprings" chromosomes are created as "hybrid" from two "parent" chromosomes. If the two parent chromosomes belong to two different groups (groups as seen on the PCA graph), then it makes things a little harder. For example, assuming L=15, we may have the two parent chromosomes

[0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,0] // Let's call it 'Alice'
[1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1] // Let's call it 'Bob'

"giving birth" to chromosome

[0,0,0,1,0,0,0,0,1,0,0,1,1,1,1,1] // This is Charlie, son of a mixture of Alice and Bob 
//           Alice <- ^ -> Bob

In practice we can have lots of these different hybrids. And hybrids can further hybridize. But still the parameter space explore is still very minimal compared to the 2^L possibilities, so there should be a way to not having to use N*L bits to represents the N chromosomes.

What data structure could I use to represent the N chromosomes so as to minimize memory usage?

I was thinking I could have a few reference chromosomes and refer to all other chromosomes by how many differences they carry to the reference chromosomes.

I could also place all the chromosomes at the tips of a B+ tree and each branch of the tree lists how many differences it has from the reference chromosomes. The very base of the B+ tree could be [0,0,0,0,...0,0] for example. Every few time steps, I could recompute the entire B+ tree to clean things up. Now the existence of hybrids will be a problem with this solution. Also, I am wondering whether in a B+ tree I want to allow reversal mutations. Maybe, I should allow some branches of the B+ tree to only consider subparts of the chromosome (like the first L/5 bits). In all cases, I am unsure the details of how all of that would be implemented (like how the B+ tree would be recomputed to clean it up).

Max Langhof · Answer 1 · 2019-12-05T18:44:55.750

My first approach would be to split each chromosome into equally sized chunks. Most chromosomes will share most of their chunks: You might have 10000 unique chromosomes of length 1000, but a given 20 bit segment will likely have only a few unique values within the population.

I would design this with cache lines in mind from the start: Since cache lines tend to be 64 bytes (give or take a factor of 2), you might want to have chunks of size ~512 and then have each chromosome represented as a series of pointers to concrete chunk values. A 20 chunk (~10k bit) chromosome instance would then only need 160 bytes (on a 64 bit machine), not 1.2k bytes. Even less if you don't use pointers but indices into some data structure. You also need to store a few different "variations" of each chunk, but if there are e.g. 10 variants of each chunk, you only need space equivalent to 10 full chromosomes.

^{Come to think of it, you might as well call a chunk a Gene - that would fit pretty well.}

This is inspired by the https://en.wikipedia.org/wiki/Flyweight_pattern. I wouldn't get too lost in the OOP diagrams though. Aside from the chromosome representation described above you'd only need some storage for the referenced chunk values and a decent way to figure out when a mutation results in a unique chunk value to add to the storage (most mutations will create unique values, the question is rather when to merge/drop one. You should probably include a small bit of space in each chunk for a reference count of sorts.).

Here's a bare-bones implementation you can play with:

#include <array>
#include <vector>
#include <cstring>
#include <set>

// TODO: Magic numbers.

struct alignas(64) Gene
{
    mutable std::uint32_t refCount;
    std::array<std::uint32_t, 16 - 1> bitStorage;
};

static_assert(sizeof(Gene) == 64); // Check that there's no padding.


struct Chromosome
{
    std::vector<const Gene*> genes;
};


class World
{
    struct GeneValueComparer
    {
        bool operator()(const Gene& lhs, const Gene& rhs) const
        {
            return std::memcmp(lhs.bitStorage.data(), rhs.bitStorage.data(), 60) < 0;
        }
    };

    std::vector<Chromosome> chroms;
    std::set<Gene, GeneValueComparer> geneBank;

    const Gene* addGene(Gene g)
    {
        auto [it, inserted] = geneBank.insert(g);
        if (inserted)
            it->refCount = 1;
        else
            it->refCount++;
        return &*it;
    }

    void duplicateChromosome(int index)
    {
        for (const Gene* g : chroms[index].genes)
            g->refCount++;
    }

    void deref(const Gene& g)
    {
        g.refCount--;
        if (g.refCount == 0)
            geneBank.erase(g);
    }

    Gene extractGene(const Gene& g)
    {
        deref(g);
        return g;
    }

    void flipOne(std::size_t chromToFlip)
    {
        for (const Gene* g : chroms[chromToFlip].genes)
        {
            Gene modifiable = extractGene(*g);
            for (auto& val : modifiable.bitStorage)
                val = !val;
            addGene(modifiable);
        }
    }
};

https://godbolt.org/z/VaTd9j

Note how obnoxiously cheap duplicating a chromosome is!

I chose a std::set to store all the genes that currently exist. I don't know whether that's the best option - it's an elegant way to guarantee uniqueness of all genes, but comparing (up to) the full gene length for each comparison is presumably slow. It would probably pay off to allow some amount of duplicated genes and then merge them occasionally. Also, since each Gene is allocated separately, any allocation overhead would end up in a different cache line, which is suboptimal. This is probably the part that can be improved the most.

What data structure to represent clustered dots within a Hamming space?

1 Answers1