There's a widely known pattern for rounding numbers up to the nearest multiple of a power of two. Increment the number by one less than the power of two, and then wipe out any bits below it:
power = 1 << i
(n + (power - 1)) & ~(power - 1)
The problem with this pattern for my use case is that 0 isn't rounded up. The obvious solution is to add a branch, but I would prefer to avoid because the performance of this code is extremely important.
I've avoided this cost in some cases with a context-specific hack. Changing an earlier (x <= FAST_PATH_LIMIT)
condition to (x - 1 <= FAST_PATH_LIMIT - 1)
forces zero to wrap, and allows handling it in the slow path. Sadly, the opportunity to do this isn't always available.
I'll happily accept a platform-specific assembly hack for a relatively obscure architecture. I just want the pleasure of knowing that there's a better way to do this. A magical trick in C or x86/ARM assembly would actually be useful though.