2

I'm developing an algorithm that uses __builtin_ffsll() with uint64_t type.

I want to switch to 512-bit field using boost multiprecision library (I'm running on a machine with avx512 support).

Is there a similar function as the mentioned builtin? Alternatively, how can I efficiently implement such functionality for 512-bit integers?

Purple
  • 711
  • 2
  • 10
  • 19

1 Answers1

3

From the documentation:

unsigned lsb(const number-or-expression-template-type& x);

Returns the (zero-based) index of the least significant bit that is set to 1.

Throws a std::range_error if the argument is <= 0.

ffs() is one-based, so adding 1 to lsb()'s return value will make it equivalent. Edit: And as pointed out, taking the case of being passed 0 into account.

Maybe something like

unsigned ffs512(const boost::multiprecision::uint512_t &n) {
  if (n.is_zero()) {
    return 0;
  } else {
    return boost::multiprecision::lsb(n) + 1;
  }
}
Community
  • 1
  • 1
Shawn
  • 47,241
  • 3
  • 26
  • 60
  • Also note the different behaviour for input == 0. It appears `lsb` is designed to wrap things like x86's `bsf` instruction which only sets a flag when the input is zero; it doesn't write a useful value to the output register. But `ffs` does handle the input==0 case, returning `0`. – Peter Cordes Oct 03 '19 at 03:15
  • Does `lsb` on a 512-bit integer actually compile to an AVX512 sequence that uses `vptestmd` / `vpcompressd` or similar to find the first non-zero dword, then `vmovd` to extract it to an integer register for `tzcnt`? (Or using 2x 256-bit vectors to avoid putting the CPU into 512-bit SIMD mode.) The OP wants this to be efficient on x86, presumably with gcc and/or clang with `-O3 -march=skylake-avx512` – Peter Cordes Oct 03 '19 at 03:21
  • @PeterCordes I imagine the GMP backend calls `mpz_scan1()`. No clue about the others. – Shawn Oct 03 '19 at 05:03
  • Godbolt doesn't have the GMP back-end installed :/ With the portable CPP back-end, gcc and clang use qword loops to check for a non-zero chunk, then `tzcnt` it. https://godbolt.org/z/qEq_QE. GCC still emits exception-creating functions but `ffs512` does optimize away any potential call to them. AVX2 to just find the right dword or qword index for a scalar load would also have been good, vs. the AVX512 idea I suggested earlier. `mpz_scan1()` is just written in C (https://gmplib.org/repo/gmp/file/tip/mpz/scan1.c), which uses the same limb search (at `short_cut:`) then trailing-zero bitscan. – Peter Cordes Oct 03 '19 at 06:02
  • 1
    @PeterCordes I think GMP only has the generic version of mp[nz]_scan1 because it was never the limiting factor, compared to multiplications or worse divisions and gcd. If you think it is useful and you can provide a faster version for large scans that does not significantly slow down the common case where the scan finds a 1 bit almost immediately, you could try contributing it to GMP. But indeed the OP has a simpler case to handle, always scanning from the beginning. And the current `lsb` is likely fast enough for them. – Marc Glisse Oct 03 '19 at 06:27
  • @MarcGlisse: yeah, the GMP version has to be much more flexible, with the size not being a compile-time constant. I think we could detect large limb-count with a single extra check that might be negligible for small numbers. But if the first `1` is in the first limb, then yes pure scalar is still much better (also avoids any risk of a store-forwarding stall from a wide load). So maybe you'd want to start with that and only look at the size for maybe doing a SIMD search if not found in the first limb. If not found in the first 64 bits, it's probably not in the next 64 either for large n. – Peter Cordes Oct 03 '19 at 06:40
  • 2
    Looks like this `uint512_t` is only typedefed for the `cpp_int` backend so it doesn't really matter for this question how the gmp backend implements it. – Shawn Oct 03 '19 at 06:40