1

I have a mask with a small number of set bits, just 3 or 4 of them. The mask can be up to 64 bit but let's take a short example - 10100101 I'd like to generate masks that stop at the set bits but include the lower bits up to the previous stop bit:

00000001
00000110
00111000
11000000

I can do that in a loop by isolating the lowest bit and adding the bits to its right ((x & -x) << 1) - 1 And then removing the previous mask using xor.

Question is can it be done more efficiently in parallel without looping with some swar or simd?

BitWhistler
  • 1,439
  • 8
  • 12
  • This appears to have variable-length output depend on how many set bits there are. I guess that's something you can get with `popcnt`, but to do it in parallel, you'd need something that could get the 3rd bit-range in one element, the 4th in another, or something. IDK, maybe something better than serial is possible, but you might need something like `pext` to clear the lowest `n` set bits and leave the range you want where `(x&-x) - x` can get it (or something; I don't think that expression actually works). And there isn't parallel `pext`, so `blsr` might be just as fast to iterate set bits – Peter Cordes Mar 21 '23 at 01:23
  • 1
    Oh, you say you have a small fixed upper bound on the number of set bits? So a long dep chain isn't a big problem, just 2x or 3x `blsr` to expose the highest set bit. That's 1 uop / 1c latency on Intel and Zen 4, vs. 2 uops on AMD before Zen 4. https://uops.info/ – Peter Cordes Mar 21 '23 at 01:26
  • The instructions from the BMI and BMI2 instruction set might be helpful here. – fuz Mar 21 '23 at 02:20
  • Where do you want the masks? In the elements of an XMM or YMM register? In an integer register one after another in a loop? In separate integer registers, hopefully with some ILP? – Peter Cordes Mar 21 '23 at 16:40
  • Sorry for the late reply. I'd like to have the results in a register so I can use it to extract these bits into separate ints – BitWhistler Apr 02 '23 at 16:34
  • Just noticed your mention of bsr, @PeterCordes. Thanks! For now I have a loop around the input so I guess I'll stay with it for now. The masked integer is the result of grabbing the high bits that I'm currently iterating... – BitWhistler Apr 02 '23 at 16:41
  • BMI1 [`blsr` (bit lowest-set reset)](https://www.felixcloutier.com/x86/blsr), not 386 `bsr` (bit-scan reverse). BLSR is `(x-1)&x` as a single instruction; compilers will optimize this for you if compiling with BMI1 available. I don't think it helps much to find the index of the highest set bit (`bsr`) separately, since that would leave a bunch of work to do to check if the bit below is contiguous. – Peter Cordes Apr 02 '23 at 18:44

1 Answers1

0

Take and use:

   #include <stdio.h>
   #include <stdint.h>
   void main() {
     int32_t val = 0xa5; // input value
     int32_t sum = 0, mask;
     while(val != 0) {
         mask = (val - 1) ^ val;
         printf("Mask-out is: %x\n", mask ^ sum);
         sum  =  mask;
         val &= ~mask;
     }
   }
olegarch
  • 3,670
  • 1
  • 20
  • 19
  • [please remove line numbers](https://meta.stackoverflow.com/q/252559/995714) – phuclv Mar 21 '23 at 09:03
  • Isn't this just an implementation of the serial scalar algorithm the OP described in the question? The question is whether something more efficient is possible. Or it's an optimization of that, using a more efficient bithack? – Peter Cordes Mar 21 '23 at 11:11
  • I did not seem the original implementation of "serial scalar", you mentioned. Thus, I not sure. But in the original description is used ((x & -x) << 1), while in my implementation - is not. Thus, definitely, this is another algorithm. But, how is is efficient, if compare to the original - I do not know. – olegarch Mar 21 '23 at 16:32
  • This isn't SIMD (so it's scalar), and each `val` depends on the previous iteration's `val` (so it's serial), with a dependency chain involving `sub` or `lea`, `xor` and `andn` instructions. Or `(val-1) ^ val` can compile into `blsmsk` (https://www.felixcloutier.com/x86/blsmsk), but then it's still a chain of 2 operations so not idea for latency. Still, the total number of operations might be better than the OP's `((x & -x) << 1) - 1`, although actually that's just `blsi` and `lea reg, [reg*2 - 1]`. But then they propose getting the next iteration with a XOR, so 3 ops, at best 3 cycle latency – Peter Cordes Mar 21 '23 at 16:46