5

There is a relatively well-known trick for unsetting a single right-most bit:

y = x & (x - 1) // 0b001011100 & 0b001011011 = 0b001011000 :)

I'm finding myself with a tight loop to clear n right-most bits, but is there a simpler algebraic trick?

Assume relatively large n (n has to be <64 for 64bit integers, but it's often on the order of 20-30).

// x = 0b001011100 n=2
for (auto i=0; i<n; i++) x &= x - 1;
// x = 0b001010000

I've thumbed my TAOCP Vol4a few times, but can't find any inspiration.

Maybe there is some hardware support for it?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
qdot
  • 6,195
  • 5
  • 44
  • 95
  • 1
    Any particular ISA you care about for HW support? I think x86 `pext` / `pdep` to make the set bits contiguous can work, to allow clearing them with AND. – Peter Cordes Jan 20 '21 at 20:55
  • Interesting - I looked at `pext`/`pdep` briefly, but it appears I'd need to compute the mask in advance, right? I can't guarantee that the n bits are continuous in the input variable. – qdot Jan 20 '21 at 21:02
  • I haven't tested this yet, but I think `pext(a,a)` will pack the bits at the bottom: the desired mask of which bits to select *is* the input number, because you want all the set bits and none of the clear bits. – Peter Cordes Jan 20 '21 at 21:06
  • 2
    Random idea: mask off some arbitrary number of bits (say `(64+n)/2`), use popcount to see how many you cleared, and binary search until you get it right. Should take at most 6 iterations, but the unpredictable branches might be a killer, unless there is a clever branchless approach. – Nate Eldredge Jan 20 '21 at 21:43
  • That's a good idea on hardware that supports `popcount` but not `pdep`. I'm really happy (myself) with @PeterCordes answer - it works, and I was actually able to optimize more places once I read BMI2 manual; I'm still curious if there is a way to accelerate it on a more restricted hardware! I'll accept it in a few days, in the meantime. lets solicit non-BMI2 alternatives! – qdot Jan 22 '21 at 00:56
  • @NateEldredge: In general you can make binary search branchless. When searching memory you often don't want to (speculative execution trigger HW prefetch, and cache misses cost more than branch misses). But here yeah you could just have something that compiles to `cmov` with the only branch being the while `popcnt(result) == popcnt(input)-n` loop branch. – Peter Cordes Jan 22 '21 at 01:06
  • @NateEldredge: Added that idea to my answer as a fallback. Since popcount doesn't care *where* the bits are, it's often more efficient to shift the bit than to create a mask and then AND it, but the basic idea is solid and may be faster than PDEP on AMD. – Peter Cordes Jan 22 '21 at 05:10
  • related: [Efficiently finding the position of the k'th set bit in a bitset](https://stackoverflow.com/q/28485961/995714) – phuclv Mar 26 '21 at 10:05

1 Answers1

3

For Intel x86 CPUs with BMI2, pext and pdep are fast. AMD before Zen3 has very slow microcoded PEXT/PDEP (https://uops.info/) so be careful with this; other options might be faster on AMD, maybe even blsi in a loop, or better a binary-search on popcount (see below).
Only Intel has dedicated hardware execution units for the mask-controlled pack/unpack that pext/pdep do, making it constant-time: 1 uop, 3 cycle latency, can only run on port 1.

I'm not aware of other ISAs having a similar bit-packing hardware operation.


pdep basics: pdep(-1ULL, a) == a. Taking the low popcnt(a) bits from the first operand, and depositing them at the places where a has set bits, will give you a back again.

But if, instead of all-ones, your source of bits has the low N bits cleared, the first N set bits in a will grab a 0 instead of 1. This is exactly what you want.

uint64_t unset_first_n_bits_bmi2(uint64_t a, int n){
    return _pdep_u64(-1ULL << n, a);
}

-1ULL << n works for n=0..63 in C. x86 asm scalar shift instructions mask their count (effectively &63), so that's probably what will happen for the C undefined-behaviour of a larger n. If you care, use n&63 in the source so the behaviour is well-defined in C, and it can still compile to a shift instruction that uses the count directly.

On Godbolt with a simple looping reference implementation, showing that they produce the same result for a sample input a and n.

GCC and clang both compile it the obvious way, as written:

# GCC10.2 -O3 -march=skylake
unset_first_n_bits_bmi2(unsigned long, int):
        mov     rax, -1
        shlx    rax, rax, rsi
        pdep    rax, rax, rdi
        ret

(SHLX is single-uop, 1 cycle latency, unlike legacy variable-count shifts that update FLAGS... except if CL=0)

So this has 3 cycle latency from a->output (just pdep)
and 4 cycle latency from n->output (shlx, pdep).

And is only 3 uops for the front-end.


A semi-related BMI2 trick:

pext(a,a) will pack the bits at the bottom, like (1ULL<<popcnt(a)) - 1 but without overflow if all bits are set.

Clearing the low N bits of that with an AND mask, and expanding with pdep would work. But that's an overcomplicated expensive way to create a source of bits with enough ones above N zeros, which is all that actually matters for pdep. Thanks to @harold for spotting this in the first version of this answer.


Without fast PDEP: perhaps binary search for the right popcount

@Nate's suggestion of a binary search for how many low bits to clear is probably a good alternative to pdep.

Stop when popcount(x>>c) == popcount(x) - N to find out how many low bits to clear, preferably with branchless updating of c. (e.g. c = foo ? a : b often compiles to cmov).

Once you're done searching, x & (-1ULL<<c) uses that count, or just tmp << c to shift back the x>>c result you already have. Using right-shift directly is cheaper than generating a new mask and using it every iteration.

High-performance popcount is relatively widely available on modern CPUs. (Although not baseline for x86-64; you still need to compile with -mpopcnt or -march=native).

Tuning this could involve choosing a likely starting-point, and perhaps using a max initial step size instead of pure binary search. Getting some instruction-level parallelism out of trying some initial guesses could perhaps help shorten the latency bottleneck.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 4
    Couldn't this be done with `pdep(-1ULL << n, a)`? – harold Jan 20 '21 at 21:45
  • @harold: Updated, thanks. I was thinking of the problem like AVX512 `vpcompressd` where each input element has an identity, but a 1 bit is just a 1 bit, and doesn't have to come from the original input. – Peter Cordes Jan 20 '21 at 22:21