Branchless Overflow Handling

Question

I'm trying to create a type of safe buffer that automatically handles overflow without any branching. The buffer size is a power of two and shall only have valid positive (i.e. not including zero) indices. It also allows checked removal, which is removal at a given index if the element stored at that index is equal to a search key.

I was essentially going for something like this

Element *buffer[256];

inline void buffer_insert(size_t index, Element *elem){
  buffer[index < 256 && index] = elem;
}

//Optional: checked insert to prevent overwrite. Will only insert
//if the buffer holds NULL at index.
inline void buffer_checkedInsert(size_t index, Element * elem){
  buffer[index && !buffer[index < 256 && index]] = elem;  
}

inline void buffer_checkedRemove(size_t index, Element *elem){
  buffer[0] = NULL; //Maybe useful if buffer[0] stores elem
  buffer[((elem == buffer[index < 256 && index)) && index] = NULL;
}

So I basically want to access index 0 whenever the index passed in is out of bounds, as buffer[0] is not a valid buffer index. And I also want to access index 0 whenever the element to be removed is not equal to the element that is passed into the removal, and I might want to also access index 0 if the buffer contains something at index.

My questions are:

Is what I have really branchless? Because if the C compiler decides to use short-circuiting on &&, the code might get branched.
If && causes branching, is there an alternative that has the same behavior in this case that does not involve branching?
Can this be faster than a basic overflow check? Or could the C compiler somehow give a branchless version of if(index < 256) buffer[index] = elem?

`&&` is short-circuiting by design. Its use generally emits a branch. Using the result of a comparison operator as a value might also cause a branch to be emitted, depending on architecture (on x86 it doesn't). — fuz, Oct 04 '16 at 20:26
As a conceptual question: Think if it is really better to have out-of-bounds reads and writes silently do nothing rather than crashing as they should. Also think if branchless code is really worth the additional length. A jump taken almost never is very cheap and I don't think you are going to trigger the overflow check more than occasionally. — fuz, Oct 04 '16 at 20:30
`&&` doesn't do what you think it does. e.g. the result of `&&` can only ever be `0` or `1`. — , Oct 04 '16 at 20:37
`buffer[index < 256 && index]` will only ever access `buffer[0]` and `buffer[1]`. Certainly this is not OP's goal. So it appears code is posted that was not first tested for functionality before considering questions like " really branchless?" and "Can this be faster". Suggest making test code that not only tests functionality, but can be used to assess performance. Functionality first, performance second. — chux - Reinstate Monica, Oct 04 '16 at 21:37
Good point on the functionality. As for using (index < 256) * index, could it be faster (and branchless) to do something like (one_zero[(index < 256)] & index), where one_zero is a global array containing {0xffffffffffffffff, 0x0}? Or would the loading be too expensive? — Navneeth, Oct 04 '16 at 22:17
@Navneeth: The answer would depend entirely on the architecture, whether the global array was `const`, the compiler implementation, how hot the code is (determines whether array is kept in cache line), etc. There is no way to know aside from actually testing it. Odds are, the multiply won't be any slower; integer multiply is fast on modern chips, and as long as you're not performing other multiply operations in close proximity, the multiply will finish in (amortized) 1-2 cycles, which is identical to bitwise-and (which can often be dispatched w/more parallelism, but the load might cost you). — ShadowRanger, Oct 04 '16 at 22:33

score 2 · Accepted Answer · answered Oct 04 '16 at 22:23

Is what I have really branchless? Because if the C compiler decides to use short-circuiting on &&, the code might get branched.

Maybe. The compiler might be clever enough to emit branchless machine code in these cases, but you cannot rely on it.

If && causes branching, is there an alternative that has the same behavior in this case that does not involve branching?

Your question is a bit confused. The fact that a compiler may emit branching code to implement the && operation follows from the defined behavior of that operation. Any alternative that had the same behavior must afford the same possibility of branching.

On the other hand, if you mean to ask whether there is an alternative that computes the same result in all cases, then yes, you can rewrite those expressions to do so without the possibility of branching. For instance, you could use either the & or the * operator like so:

buffer[(index < 256) & (index != 0)] = elem;

Or, you could implement the behavior you actually want:

buffer[(index < 256) * index] = elem;

There's no reason to think that the compiler would emit a branch instruction for either of those computations; if it did, that would probably be because it thinks that would provide a performance improvement on the target architecture.

Can this be faster than a basic overflow check? Or could the C compiler somehow give a branchless version of if(index < 256) buffer[index] = elem?

The branchless versions certainly can be faster. They are most likely to be observably faster on workloads where the (non-)branch is executed a lot, and there is no easily-discernible pattern to which alternative is taken. But if the (non-)branching mostly follows a regular pattern, and especially if it almost always goes one way, then the CPU's branch prediction unit could make an ordinary validity check at least as fast as the branchless assignments.

Ultimately, there's no good reason to worry about this without benchmarking the actual performance of your code on real data, or a good facsimile thereof. The result is likely to be data dependent, and whether it matters at all depends on how much of the program's run time is spent in the functions you ask about. Until and unless you have a good benchmark demanding otherwise, you should code for clarity and maintainability.

Thanks, that about covers it. I'm using this for a cache in a memory allocator implementation. The branchy cache implementation seems to make the allocator slower in every test case, so I'm trying to see if this makes an improvement. — Navneeth, Oct 04 '16 at 22:35

Branchless Overflow Handling

1 Answers1