quickly rounding numbers >= 0 up to the multiple of a specific power of 2

Question

There's a widely known pattern for rounding numbers up to the nearest multiple of a power of two. Increment the number by one less than the power of two, and then wipe out any bits below it:

power = 1 << i
(n + (power - 1)) & ~(power - 1)

The problem with this pattern for my use case is that 0 isn't rounded up. The obvious solution is to add a branch, but I would prefer to avoid because the performance of this code is extremely important.

I've avoided this cost in some cases with a context-specific hack. Changing an earlier (x <= FAST_PATH_LIMIT) condition to (x - 1 <= FAST_PATH_LIMIT - 1) forces zero to wrap, and allows handling it in the slow path. Sadly, the opportunity to do this isn't always available.

I'll happily accept a platform-specific assembly hack for a relatively obscure architecture. I just want the pleasure of knowing that there's a better way to do this. A magical trick in C or x86/ARM assembly would actually be useful though.

to avoid all the computation time. create a static table those values are the result of the round up for the values of 'i' then access the table with something like n = table[i], — user3629249, Feb 16 '15 at 03:33
depending on the system, the ~12 operations needed to calculate the next power of 2 could be faster than waiting for a cache miss — Benni, Feb 16 '15 at 03:50
A table would work, but it would be quite large. It could be compressed, but that's going to be paying the same costs along with the cost of accessing the table simply to avoid a branch. I expect it would be cheaper, but it's not ideal. — strcat, Feb 16 '15 at 03:55
@Benni: I need rounding to a specific multiple of a power of 2 rather than the next power of 2, so it's a lot cheaper than 12 operations - at least without this annoying zero case that I need to handle. — strcat, Feb 16 '15 at 03:56
@DougCurrie: Sure, which is why the `(n + (power - 1)) & ~(power - 1)` pattern works in general. I need zero rounded up though, and I'd rather not pay for a branch because I've determined that it's a significant cost in this hot fast path. — strcat, Feb 16 '15 at 03:57

score 2 · Answer 1 · answered Feb 16 '15 at 03:28

2

ARM has a CLZ (Count Leading Zeros) instruction that lets you do this without a loop. Intel has a roughly equivalent BFS (Bit Scan Forward). Either lets you quickly prepare a mask.

http://en.wikipedia.org/wiki/Find_first_set

answered Feb 16 '15 at 03:28

Seva Alekseyev

59,826
25
160
281

Doug Currie · Accepted Answer · 2015-02-16T13:35:39.367

2

If you want zero and other already rounded powers of two to always round up, then:

((n | 1) + (power - 1)) & ~(power - 1)

Or if just for zero

((n | (!n)) + (power - 1)) & ~(power - 1)

Many architectures, such as PPC, have non branching (!n)

edited Feb 16 '15 at 13:35

answered Feb 16 '15 at 03:59

Doug Currie

40,708
1
95
119

1

This is close to what I want, but it means values that are already rounded get moved up to the next multiple. For example, when rounding to the nearest 32 it will round 32 to 64. – strcat Feb 16 '15 at 04:02
1

It's odd that you want zero, which is already rounded, to be incremented to the next power, but other values that are already rounded not incremented. – Doug Currie Feb 16 '15 at 04:04
If you're curious, the use case is inside a slab allocator where the allocation sizes need to be rounded up to the minimum alignment. Using `n | (!n)` is a clean way of eliminating zero as an edge case in general. GCC and Clang are lacking when it comes to optimizations like this... – strcat Feb 16 '15 at 04:19
1

Branchless source code may not end up branchless machine code. Also, branchless machine code may be slower than branching. I would say use clearly readable conditional source and let the compiler figure it out. You can help by using `__builtin_expect` if available. – Jester Feb 16 '15 at 12:47
Replacing the comparison and if statement with the boolean NOT and bitwise OR instruction is a significant performance win in this case. The branch (and comparison) is not very expensive because it almost always gets predicted correctly, but it's still *significantly more expensive* than this. Branch predictors don't have limitless resources and this is a hot inner code path. Neither GCC or Clang optimizes the if statement to the same code. Using `__builtin_expect` doesn't really do much - it moves the slow paths out of the way for a clean fast path, but the code in this case is 1 instruction. – strcat Feb 19 '15 at 13:58
gcc compiles this to [2 instructions for x86](https://godbolt.org/g/XXFi9I), when `power` is a compile-time constant. It's similarly efficient for powerpc and ARM, as you can see on that godbolt link. When `power` isn't a compile-time constant, gcc would do better to use `xor edx,edx` / `bts edx, esi` to do `1< – Peter Cordes Jun 08 '16 at 20:19

score 1 · Answer 3 · edited Jun 08 '16 at 16:29

1

For a platform specific way in x86 assembly I'll add this one:

mov edx, num
mov eax, 1
xor ebx, ebx     ; EBX = 0 for use in CMOVZ
rep bsr ecx, edx ; get index of highest bit set - if num is 0 ECX would be undefined...  use faster LZCNT if available.
cmovz ecx, ebx   ; ...so set it to 0 if that's the case
shl eax, cl      ; get power of 2
cmp eax, edx     ; internally subtract num, which results in negative value (borrow/carry) except if it's already a power of 2 or zero
setc cl          ; if negative value(borrow/carry)...
shl eax, cl      ; ...then shift by one to next highest power
; EAX = result

Although another question has already been accepted, this is a different way to do it.

edited Jun 08 '16 at 16:29

Johan

74,508
24
191
319

answered Feb 16 '15 at 09:25

zx485

28,498
28
50
59

This should actually work, on CPUs that support `lzcnt`, but it's a bit funky: `lzcnt` sets flags according to the result, not the input. When the input is zero, it produces `ecx = 32` (the operand-size), and sets `CF`. Since `shl` masks its count, this will result in a shift by `0` (no shift). `lzcnt` is only faster on AMD, but it's the same speed on Intel, so I guess the extra byte is worth using. However, did you consider using `eax=0` / `bts eax, ecx` instead of the shift? Then you don't need `ebx`; you use `eax` as your zeroed reg for `cmov` and `bts`. Also, bts is faster on SnB. – Peter Cordes Jun 08 '16 at 19:27
ecx=32, eax=0: `bts eax, ecx` -> eax=1. So this is safe with `lzcnt` instead of `bsr`. ([`bts r, i/r` takes the bit-position modulo operand-size](http://www.felixcloutier.com/x86/BTS.html)) – Peter Cordes Jun 08 '16 at 19:35
Actually, wait a minute, this rounds up to the **next** power of 2. The OP wants to round up to a **multiple of a *specific* power of 2**. e.g. to round an allocation size up from 257 to 272 (multiple of 16). – Peter Cordes Jun 08 '16 at 20:10

score 0 · Answer 4 · answered Feb 16 '15 at 04:02

If the range of input values is reasonably restricted, like 0..255, you could use a lookup table:

const unsigned char roundup_pow2 [] = {1, 2, 2, 2, 4, 4, 4, 4, // ...
};

unsigned int restricted_roundup_power2 (int v)
{
     if (v >= 0  &&  v <= sizeof roundup_pows)
           return roundup_pow2 [v];
     return 0; // ???
}

The range could be extended reusing itself:

unsigned int roundup_power2 (int v)
{
     if (v >= 0  &&  v <= sizeof roundup_pows)
           return roundup_pow2 [v];
     return 8 + roundup_power2 (v >> 8);
}

Of course, a simple program (left as an exercise) could be written to create the table values instead of figuring them out manually.

quickly rounding numbers >= 0 up to the multiple of a specific power of 2

4 Answers4