simple thread safe and fast memory pool implementation?

Question

After rethinking the design and with some input from paddy I came up with something like this, but I wonder on the correctness of it, it seems fine when I run it... The idea is that preallocated objects inherit from the following:

struct Node
{
    void* pool;
};

That way we inject in every allocated object a pointer to it's pool for later releasing it. Then we have:

template<class T, int thesize>
struct MemPool
{
    T* getNext();
    void free(T* ptr);

    struct ThreadLocalMemPool
    {
        T* getNextTL();
        void freeTL();

        int size;
        vector<T*> buffer;
        vector<int> freeList;
        int freeListIdx;
        int bufferIdx;
        ThreadLocalMemPool* nextTlPool; //within a thread's context a linked list
    };

    int size;
    threadlocal ThreadLocalMemPool* tlPool; //one of these per thread
};

So basically I say MemPool<Cat, 100> and it gives me a mempool which for every thread that getNexts it, will instantiate a threadlocal mempool. Sizes I round internally to nearest power of two for easy modulo (which for simplicity ill omit). Because getNext() is local to each thread, it does not require locking, and I try to use atomics for the freeing part as follows:

T* ThreadLocalMemPool::getNextTL()
{
    int iHead = ++bufferIdx % size;
    int iTail = freeListIdx % size;

    if (iHead != iTail)  // If head reaches tail, the free list is empty.
    {
        int & idx = freeList[iHead];
        while (idx == DIRTY) {}
        return buffer[idx];
    }
    else
    {
        bufferIdx--; //we will recheck next time
        if (nextTLPool)
            return nextTLPool->getNextTL();
        else
            //set nextTLPool to a new ThreadLocalMemPool and return getNextTL() from it..
    }
}

void ThreadLocalMemPool::free(T* ptr)
{
    //the outer struct handles calling this in the right ThreadLocalMemPool

    //we compute the index in the pool from which this pool came from by subtracting from
    //its address the address of the first pointer in this guys buffer
    int idx = computeAsInComment(ptr);

    int oldListIdx = atomic_increment_returns_old_value(freeListIdx);
    freeList[oldListIdx % size] = idx;
}

Now, the idea is the freeListIdx will always trail behind the bufferIdx in a pool because you can't (I assume correct usage) free more than you have allocated. Calls to free synchronize the order in which they are returning buffer indices to the free list and the getNext will pick up on this as it cycles back. I have been thinking about it for a bit and don't see anything semantically wrong with the logic, does it seem sound or is there something subtle which could break it?

In general the advantage of using thread local is that you can use that within a single thread without locking. If you are going to be moving memory from one thread to another, the advantage is pretty much gone. — David Rodríguez - dribeas, Sep 18 '12 at 12:04

paddy · Accepted Answer · 2012-09-18T01:30:57.687

The thread-safe issue requires locking. If you want to relax that, you need the constraint that only one thread ever uses the pool. You can extend this to two threads if you use the circular free-list that I'll describe below, with the proviso that one thread is responsible for allocation and the other for deallocation.

As for using a vector without any other management, that is a bad idea... As soon as you start getting fragmented your allocations take a hit.

A nice way to implement this is to just allocate a large block of T. Then make a circular queue large enough to point to each of these blocks. That is your 'free-list'. You might just choose to use indices. If you limit each pool to 65536 items, you can choose unsigned short to save space (actually, it's 65535 to allow efficient circular queue management)

By using a circular queue, you allow constant-time allocation and deallocation regardless of fragmentation. You also know when your pool is full (ie the free-list is empty) and you can create another pool. Obviously, when you create a pool you need to fill the free-list.

So your class would look sort of like this:

template<class T, size_t initSize>
class MemPool
{
    vector<T> poolBuffer;              // The memory pool
    vector<unsigned short> freeList;   // Ring-buffer (indices of free items)
    unsigned short nHead, nTail;       // Ring-buffer management
    int nCount;                        // Number of elements in ring-buffer
    MemPool<T,initSize> *nextPool;     // For expanding memory pool

    // etc...
};

Now, for the locking. If you have access to atomic increment and decrement instructions and are reasonably careful, you can maintain the free-list with thread-safety. The only mutex-style locking required is when you need to allocate a new memory pool.

I've altered my original thinking. You kinda need two atomic operations and you need a reserved index value (0xffff) to spin on for non-atomic operations on the queue:

// I'll have the functions atomic_incr() and atomic_decr().  The assumption here
// is that they do the operation and return the value as it was prior to the
// increment/decrement.  I'll also assume they work correctly for both int and
// unsigned short types.
unsigned short atomic_incr( unsigned short & );
int atomic_incr( int & );
int atomic_decr( int & );

So the allocation goes something like:

T* alloc()
{
    // Check the queue size.  If it's zero (or less) we need to pass on
    // to the next pool and/or allocate a new one.
    if( nCount <= 0 ) {
        return alloc_extend();
    }

    int count = atomic_decr(nCount);
    if( count <= 0 ) {
        T *mem = alloc_extend();
        atomic_incr(nCount);     // undo
        return mem;
    }

    // We are guaranteed that there is at least 1 element in the list for us.
    // This will overflow naturally to achieve modulo by 65536.  You can only
    // deal with queue sizes that are a power of 2.  If you want 32768 values,
    // for example, you must do this: head &= 0x7fff;
    unsigned short head = atomic_incr(nHead);

    // Spin until the element is valid (use a reference)
    unsigned short & idx = freeList[head];
    while( idx == 0xffff );

    // Grab the pool item, and blitz the index from the queue
    T * mem = &poolBuffer[idx];
    idx = 0xffff;

    return mem;
};

The above uses a new private member function:

T * alloc_extend()
{
    if( nextPool == NULL ) {
        acquire_mutex_here();
        if( nextPool == NULL ) nextPool = new MemPool<T>;
        release_mutex_here();
        if( nextPool == NULL ) return NULL;
    }
    return nextPool->alloc();
}

When you want to free:

void free(T* mem)
{
    // Find the right pool to free from.
    if( mem < &poolBuffer.front() || mem > &poolBuffer.back() )
    {
        if( nextPool ) nextPool->free(mem);
        return;
    }

    // You might want to maintain a bitset that indicates whether the memory has
    // actually been allocated so you don't corrupt your pool here, but I won't
    // do that in this example...

    // Work out the index.  Hope my pointer arithmetic is correct here.
    unsigned short idx = (unsigned short)(mem - &poolBuffer.front());

    // Push index back onto the queue.  As with alloc(), you might want to
    // use a mask on the tail to achieve modulo.
    int tail = atomic_incr(nTail);
    freeList[tail] = idx;

    // Don't need to check the list size.  We assume everything is sane. =)
    atomic_incr(nCount);
}

Notice I used the value 0xffff, effectively as a NULL. The setting, clearing and spinning on this value are there to prevent a race situation. You cannot guarantee that it is safe to leave old data in the queue when multiple threads may be calling free while you have other threads calling alloc. Your queue will be cycling through, but the data in it might not be set yet.

Instead of indices, of course, you could just use pointers. But that is 4 bytes (or 8 bytes on a 64-bit application) and the memory overhead might not be worth it, depending on the data size you are pooling. Personally, I would use pointers, but for some reason it seemed easier to use indices in this answer.

My apologies if I've missed the mark. I sort of ignored the 'threadlocal' thing. Actually, I haven't encountered that keyword before =/ Hopefully my answer is still useful. — paddy, Sep 17 '12 at 22:36
So the circular buffer just has indices of freed elements which i can get i see..and if that fails i would then grow the poolBuffer and presumably resize the circularbuffer...i do have increment and decrement atomics what is the general mechanism for using them here? Oh and threadlocal i wrote for pseudocode clarity, in code i actually wrote __thread — Palace Chan, Sep 17 '12 at 23:16
No you wouldn't grow the circular buffer. You would just allocate a new MemPool and pass allocation requests through to that. On further thought, the circular buffer actually isn't as easy as just having atomic increment on a head and tail pointer. First, you can't do modulo unless you force the pool size to a power of two. Second, you need to test head and tail at the same time as the atomic increment. If you can't do that, you also need to maintain a list size (as a signed int) and do atomic incr/decr on that. If it falls below zero... Well, I'll edit my answer a bit... — paddy, Sep 18 '12 at 00:35
Done... err... as I said, you have to be pretty careful. I might not have got it right. Hope that helps anyway. — paddy, Sep 18 '12 at 01:25
I like the general idea, with thread local it can (i think) simplify the allocation part (each thread would have it's own pool, but deallocation can happen from any thread). I'll edit my answer later when I transfer what I'm thinking mixed with some suggestions of yours to code! — Palace Chan, Sep 18 '12 at 12:39
Looks okay to me. I'm new to this thread-local syntax, but it seems to make things a bit easier. I suppose the downside to it is if you want a huge default pool size and only one thread is doing the allocation, you'll have a lot of big empty pools in the other threads. You have made a couple of errors. I will edit your question to fix. — paddy, Sep 19 '12 at 03:33
I have edited. See if that makes sense. I figured that `bufferIdx` is your head and `freeListIdx` is your tail. Also assumed that when they are equal the list is empty. You cannot assume the list is empty just because a free index is DIRTY (that just means it's currently being freed by another thread). So I added the DIRTY spin-loop. If you would prefer to just chain to the next pool (or create a new one) in that case, then change put that test back in and take out the loop... But make sure you keep the head/tail comparison because that will indicate when your pool is exhausted. — paddy, Sep 19 '12 at 03:47
The thread local mempools are only instantiated upon first request to a getNext from a thread..so only a thread which allocates will have memory pools. I'm a bit confused with the spin addition. My train of thought was like so: you allocate whatever index is in freeList by traversing it with bufferIdx++; The guys who free, do so by writing to freeList as well with atomic freeListIdx++ (atomic because many threads can free into same mempool, but only one can allocate). — Palace Chan, Sep 19 '12 at 12:14
At any point at which a freeing thread writes into freeList, the invariant freeListIdx < bufferIdx holds (modulo). This is because you can only free that which has been allocated (and so bufferIdx will always be at least one ahead of you in the allocation check). It is true that if you loop back and a freeing thread is in the middle of freeing a dirty position you could read dirty, then the freeing thread quickly writes some valid idx there..and you allocate a new pool...but this is ok, next time you will catch it..it seems like an innocuous race to me at least..or is there a worse scenario? — Palace Chan, Sep 19 '12 at 12:19
The deal is that if you have one or more threads freeing memory in another thread's pool and that pool is mostly used, it's conceivable that while you are trying to allocate memory it will appear that there are items in the free list but the indices have not yet been restored. In this case, the choice is to spin on the value (it would be a very small amount of spinning) or, like you say, just use the next pool and know that next time through the index will probably be restored. You're probably justified in choosing the latter option. The more your pool grows, the less often it will happen. — paddy, Sep 19 '12 at 23:59
Don't you need to divide by sizeof(T) in this line: unsigned short idx = (unsigned short)(mem - &poolBuffer.front()); ? — Mr. Developerdude, Mar 01 '14 at 21:02
@LennartRolland No, this is computing an index into a vector whose element datatype is `T`. The pointer arithmetic is correct. Maybe you need to brush up by writing yourself a test program that outputs the result of pointer arithmetic on different datatypes. — paddy, Mar 02 '14 at 23:24

simple thread safe and fast memory pool implementation?

1 Answers1