For Intel x86 CPUs with BMI2, pext
and pdep
are fast. AMD before Zen3 has very slow microcoded PEXT/PDEP (https://uops.info/) so be careful with this; other options might be faster on AMD, maybe even blsi
in a loop, or better a binary-search on popcount (see below).
Only Intel has dedicated hardware execution units for the mask-controlled pack/unpack that pext/pdep do, making it constant-time: 1 uop, 3 cycle latency, can only run on port 1.
I'm not aware of other ISAs having a similar bit-packing hardware operation.
pdep
basics: pdep(-1ULL, a) == a
. Taking the low popcnt(a) bits from the first operand, and depositing them at the places where a
has set bits, will give you a
back again.
But if, instead of all-ones, your source of bits has the low N bits cleared, the first N set bits in a
will grab a 0 instead of 1. This is exactly what you want.
uint64_t unset_first_n_bits_bmi2(uint64_t a, int n){
return _pdep_u64(-1ULL << n, a);
}
-1ULL << n
works for n=0..63 in C. x86 asm scalar shift instructions mask their count (effectively &63
), so that's probably what will happen for the C undefined-behaviour of a larger n
. If you care, use n&63
in the source so the behaviour is well-defined in C, and it can still compile to a shift instruction that uses the count directly.
On Godbolt with a simple looping reference implementation, showing that they produce the same result for a sample input a
and n
.
GCC and clang both compile it the obvious way, as written:
# GCC10.2 -O3 -march=skylake
unset_first_n_bits_bmi2(unsigned long, int):
mov rax, -1
shlx rax, rax, rsi
pdep rax, rax, rdi
ret
(SHLX is single-uop, 1 cycle latency, unlike legacy variable-count shifts that update FLAGS... except if CL=0)
So this has 3 cycle latency from a
->output (just pdep)
and 4 cycle latency from n
->output (shlx, pdep).
And is only 3 uops for the front-end.
A semi-related BMI2 trick:
pext(a,a)
will pack the bits at the bottom, like (1ULL<<popcnt(a)) - 1
but without overflow if all bits are set.
Clearing the low N bits of that with an AND mask, and expanding with pdep
would work. But that's an overcomplicated expensive way to create a source of bits with enough ones above N zeros, which is all that actually matters for pdep. Thanks to @harold for spotting this in the first version of this answer.
Without fast PDEP: perhaps binary search for the right popcount
@Nate's suggestion of a binary search for how many low bits to clear is probably a good alternative to pdep.
Stop when popcount(x>>c) == popcount(x) - N
to find out how many low bits to clear, preferably with branchless updating of c
. (e.g. c = foo ? a : b
often compiles to cmov).
Once you're done searching, x & (-1ULL<<c)
uses that count, or just tmp << c
to shift back the x>>c
result you already have. Using right-shift directly is cheaper than generating a new mask and using it every iteration.
High-performance popcount is relatively widely available on modern CPUs. (Although not baseline for x86-64; you still need to compile with -mpopcnt
or -march=native
).
Tuning this could involve choosing a likely starting-point, and perhaps using a max initial step size instead of pure binary search. Getting some instruction-level parallelism out of trying some initial guesses could perhaps help shorten the latency bottleneck.