6

Anyone thought about how to write a memory manager (in C++) that is completely branch free? I've written a pool, a stack, a queue, and a linked list (allocating from the pool), but I am wondering how plausible it is to write a branch free general memory manager.

This is all to help make a really reusable framework for doing solid concurrent, in-order CPU, and cache friendly development.

Edit: by branchless I mean without doing direct or indirect function calls, and without using ifs. I've been thinking that I can probably implement something that first changes the requested size to zero for false calls, but haven't really got much more than that. I feel that it's not impossible, but the other aspect of this exercise is then profiling it on said "unfriendly" processors to see if it's worth trying as hard as this to avoid branching.

skaffman
  • 398,947
  • 96
  • 818
  • 769
Richard Fabian
  • 697
  • 1
  • 5
  • 18
  • 2
    What do you mean by a "branch"? –  Mar 22 '10 at 10:23
  • @Neil, I suppose, it's something that splits control flow (`if` operator, for example). – P Shved Mar 22 '10 at 10:30
  • If branch means `if`, then the answer is just no. @OP: could you please clarify if that is indeed what you mean? – Björn Pollex Mar 22 '10 at 11:10
  • 2
    This would be an out-of-bounds answer, so I am adding it only as a comment. You may want to consider having an implementation with as few branches as possible and adding compiler/platform dependent code to help the branch predictor. – David Rodríguez - dribeas Mar 22 '10 at 12:55
  • Branchless C++ code does not necessarily translate into branchless assembly. Also, branching C++ code does not necessarily translate into branching assembly. So if the goal is to be extremely friendly to CPUs that can barely handle branches, then branchless C++ is probably not the right tool. – MSalters Mar 22 '10 at 13:03
  • It would be enough to have all branches be predicted correctly almost all of the time. That should be an easier problem to solve. – usr May 01 '15 at 09:13
  • I've always understood "branchless" to mean, code given in a way that always writes all the outputs and (usually) merges masked bits and computes mask from boolean computations. If a run of code is one basic block, how could it branch? In my experience, compilers don't avoid branches (very much), even with optimizations like `-fif-conversion` and `-fif-conversion2`. Comparator callbacks for binary searches or ordered maps have random branch behavior. Branch mispredict is expensive. This would only help if there were multiple comparisons per `<` call. – doug65536 May 08 '16 at 23:44

2 Answers2

2

While I don't think this is a good idea, one solution would be to have pre-allocated buckets of various log2 sizes, stupid pseudocode:

class Allocator {

    void* malloc(size_t size) {
        int bucket = log2(size + sizeof(int));
        int* pointer = reinterpret_cast<int*>(m_buckets[bucket].back());
        m_buckets[bucket].pop_back();
        *pointer = bucket; //Store which bucket this was allocated from
        return pointer + 1; //Dont overwrite header
    }

    void free(void* pointer) {
        int* temp = reinterpret_cast<int*>(pointer) - 1;
        m_buckets[*temp].push_back(temp);
    }

    vector< vector<void*> > m_buckets;
};

(You would of course also replace the std::vector with a simple array + counter).

EDIT: In order to make this robust (i.e. handle the situation where the bucket is empty) you would have to add some form of branching.

EDIT2: Here's a small branchless log2 function:

//returns the smallest x such that value <= (1 << x)
int
log2(int value) {
    union Foo {
        int x;
        float y;
    } foo;
    foo.y = value - 1;
    return ((foo.x & (0xFF << 23)) >> 23) - 126; //Extract exponent (base 2) of floating point number
}

This gives the correct result for allocations < 33554432 bytes. If you need larger allocations you'll have to switch to doubles.

Here's a link to how floating point numbers are represented in memory.

Andreas Brinck
  • 51,293
  • 14
  • 84
  • 114
  • 1
    Log2 will probably need a platform dependant implementation to be branchless. On x86 you'll probably need something doing a BSR instruction on the arguments. – Jasper Bekkers Mar 22 '10 at 11:33
  • @Jasper: there's some code here that claims to be a branchless clz - I assume without testing that it works: http://stackoverflow.com/questions/2255177/finding-the-exponent-of-n-2x-using-bitwise-operations-logarithm-in-base-2-of/2255282#2255282. From a brief skim, it seems to return 0 for input 0, so you may want a branch to cover either the 0 case, or the greater-than-half-the-range case. As you say, though, implementations may provide access to faster CPU ops. – Steve Jessop Mar 22 '10 at 11:44
  • I guess. Going through float might be the best log2 on some platforms, but it's only pseudo-portable, so it can't be the most general fallback. – Steve Jessop Mar 22 '10 at 13:01
  • "If you need larger allocations you'll have to switch to doubles". Assuming you'd be allocating larger blocks out of power-of-two buckets in the first place. Hopefully "branchless" means "except for uncommon cases". – Steve Jessop Mar 22 '10 at 13:05
  • The `log2` method blew my mind, I am not sure I'll ever get those brain cells back... – Matthieu M. Mar 22 '10 at 16:08
  • @Matthieu It's just taking the exponent part of the floating point value by interpreting it as an int. – Jasper Bekkers Mar 23 '10 at 00:18
  • Well, I am not that well versed in the physical representation of the `float`... even if I kind of admire the trick, is it portable ? Anyhow I think it would at least benefit from a comment to indicate what's going on, I sure didn't realize we were extracting the exponent! – Matthieu M. Mar 23 '10 at 07:20
  • @Matthieu I think it's dependent on endianness, but this is easily handled. Apart from this if your CPU supports IEEE 754 floats, it should work fine. Added a link to the relevant wikipedia article on floats for those interested in the details of how floats are represented in memory. – Andreas Brinck Mar 23 '10 at 08:33
  • This is a lovely example of an allocator (from an academic standpoint, a log2 allocator might not be so handy in practice), but I must throw a wrench in the middle of this and ask, how does one get a branchless `push_back`, or avoid any form of branching whatsoever in the array + counter alternative? Any time I think variable-sized container, I think branching has to be involved unless it's a linked list, but then we pay for the allocation overhead per chunk. :-( –  May 01 '15 at 08:42
0

The only way I know to create a truly branchless allocator is to reserve all the memory it will potentially use in advance. Otherwise there's always going to be some hidden code somewhere to see if we're exceeding some current capacity whether it's in a hidden push_back in a vector checking if the size exceeds capacity used to implement it or something of that sort.

Here is one such crude example of a fixed alloc which has a completely branchless malloc and free method.

class FixedAlloc
{
public:
    FixedAlloc(int element_size, int num_reserve)
    {
        element_size = max(element_size, sizeof(Chunk));
        mem = new char[num_reserve * element_size];

        char* ptr = mem;
        free_chunk = reinterpret_cast<Chunk*>(ptr);
        free_chunk->next = 0;

        Chunk* last_chunk = free_chunk;
        for (int j=1; j < num_reserve; ++j)
        {
            ptr += element_size;
            Chunk* chunk = reinterpret_cast<Chunk*>(ptr);
            chunk->next = 0;
            last_chunk->next = chunk;
            last_chunk = chunk;
        }
    }

    ~FixedAlloc()
    {
        delete[] mem;
    }

    void* malloc()
    {
        assert(free_chunk && free_chunk->next && "Reserve memory exhausted!");
        Chunk* chunk = free_chunk;
        free_chunk = free_chunk->next;
        return chunk->mem;
    }

    void free(void* mem)
    {
        Chunk* chunk = static_cast<Chunk*>(mem);
        chunk->next = free_chunk;
        free_chunk = chunk;
    }

private:
    union Chunk
    {
        Chunk* next;
        char mem[1];
    };
    char* mem;
    Chunk* free_chunk;
};

Since it's totally branchless, it simply segfaults if you try to allocate more memory than initially reserved. It also has undefined behavior for trying to free a null pointer. I also avoided dealing with alignment for the sake of a simpler example.