Bit vector operation with AVX2 and SSE2

Question

I am new to AVX2 and SSE2 instruction sets, and I want to learn more on how to use such instruction sets to speed-up bit vector operations.

So far I have used them successfully to vectorize the codes with double / float operations.

In this example, I have a C++ code that checks a condition before to set or not a bit in a bit vector (using unsigned int) to a specific value:

int process_bit_vetcor(unsigned int *bitVector, float *value, const float threshold, const unsigned int dim)
{
       int sum = 0, cond = 0;

       for (unsigned int i = 0; i < dim; i++) {
            unsigned int *word = bitVector + i / 32;
            unsigned int bitValue = ((unsigned int)0x80000000 >> (i & 0x1f));
            cond = (value[i] <= threshold);
            (*word) = (cond) ? (*word) | bitValue : (*word);
            sum += cond;
        }

        return sum;
}

The variable sum just returns the number of cases where the condition is TRUE.

I tried to rewrite this routine with SSE2 and AVX2 but it didn't work out... :-(

Is it possible to rewrite such C++ code using AVX2 and SSE2? Is it worth to use vectorization for such type of bit operations? The bit vector could contain many thousands of bits so I hope it could be interesting to use SSE2 and AVX2 to speed-up.

Thanks in advance!

You're going to want `_mm_movemask_ps` and scalar `|=` on a 32-bit chunk of mask data. And popcnt that. I think your bit-indexing is backwards, reversed within each `unsigned int` from the order you're reading `float value[]` but I assume that's unintentional. — Peter Cordes, Nov 04 '19 at 13:19
Are you sure that `bitValue` is a pointer, instead of an `unsigned int`? — chtz, Nov 04 '19 at 13:22
It is a typo, you are right it is not a pointer, I am going to fix the code example. — Liotro78, Nov 04 '19 at 13:23
@PeterCorbes first of all thanks for your comments! Can you give me an example code? I am still a bit confused on how to combine together all pieces... Thanks in advance. — Liotro78, Nov 04 '19 at 13:48
Can you clarify Peter's question about bit-order? Do you really want `bitVector` to be big-endian (per int32)? Also, does `bitVector` contain non-zero information before calling your function? — chtz, Nov 06 '19 at 12:39
Hello @chtz, 1) No, it should not be big-endian, that was unintentional. 2) Yes, it can contain non-zero values before calling the function. Regards. — Liotro78, Nov 07 '19 at 13:17

chtz · Answer 1 · 2019-11-11T01:20:26.267

The following should work, if dim is a multiple of 8 (to handle the remainder, add a trivial loop at the end). Minor API-changes:

Use long instead of unsigned int for loop indices (this helps clang unrolling the loop)
Assume bitvector is little-endian (as suggested in the comments)

Inside the loop, bitVector is accessed byte-wise. It might be worth to combine 2 or 4 results of movemask and bit-or them at once (probably depends on the target architecture).

To calculate the sum, 8 partial sums are calculated directly from the result of the cmp_ps operation. Since you need the bitmask anyway, it may be worth to use popcnt (ideally after combining 2, 4, or 8 bytes together -- again, this probably depends on your target architecture).

int process_bit_vector(uint32_t *bitVector32, float *value,
                       const float threshold_float, const long dim) {
  __m256i sum = _mm256_setzero_si256();
  __m256 threshold_vector = _mm256_set1_ps(threshold_float);
  uint8_t *bitVector8 = (uint8_t *)bitVector32;

  for (long i = 0; i <= dim-8; i += 8) {
    // compare next 8 values with threshold
    // (use threshold as first operand to allow loading other operand from memory)
    __m256 cmp_mask = _mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i), _CMP_GE_OQ);
    // true values are `-1` when interpreted as integers, subtract those from `sum`
    sum = _mm256_sub_epi32(sum, _mm256_castps_si256(cmp_mask));
    // extract bitmask
    int mask = _mm256_movemask_ps(cmp_mask);
    // bitwise-or current mask with result bit-vector
    *bitVector8++ |= mask;
  }

  // reduce 8 partial sums to a single sum and return
  __m128i sum_reduced = _mm_add_epi32(_mm256_castsi256_si128(sum), _mm256_extracti128_si256(sum,1));
  sum_reduced = _mm_add_epi32(sum_reduced, _mm_srli_si128(sum_reduced, 8));
  sum_reduced = _mm_add_epi32(sum_reduced, _mm_srli_si128(sum_reduced, 4));

  return _mm_cvtsi128_si32(sum_reduced);
}

Godbolt-Link: https://godbolt.org/z/ABwDPe

For some reason GCC does vpsubd ymm2, ymm0, ymm1; vmovdqa ymm0, ymm2; instead of just vpsubd ymm0, ymm0, ymm1.
Clang fails to join the load with the vcmpps (and uses LE instead of GE comparison) -- if you don't care about how NaNs are handled, you could use _CMP_NLT_US instead of _CMP_GE_OQ.

Revised version with big-endian output (untested):

int process_bit_vector(uint32_t *bitVector32, float *value,
                       const float threshold_float, const long dim) {
  int sum = 0;
  __m256 threshold_vector = _mm256_set1_ps(threshold_float);

  for (long i = 0; i <= dim-32; i += 32) {
    // compare next 4x8 values with threshold
    // (use threshold as first operand to allow loading other operand from memory)
    __m256i cmp_maskA = _mm256_castps_si256(_mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i+ 0), _CMP_GE_OQ));
    __m256i cmp_maskB = _mm256_castps_si256(_mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i+ 8), _CMP_GE_OQ));
    __m256i cmp_maskC = _mm256_castps_si256(_mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i+16), _CMP_GE_OQ));
    __m256i cmp_maskD = _mm256_castps_si256(_mm256_cmp_ps(threshold_vector, _mm256_loadu_ps(value + i+24), _CMP_GE_OQ));

    __m256i cmp_mask = _mm256_packs_epi16(
        _mm256_packs_epi16(cmp_maskA,cmp_maskB), // b7b7b6b6'b5b5b4b4'a7a7a6a6'a5a5a4a4 b3b3b2b2'b1b1b0b0'a3a3a2a2'a1a1a0a0
        _mm256_packs_epi16(cmp_maskC,cmp_maskD)  // d7d7d6d6'd5d5d4d4'c7c7c6c6'c5c5c4c4 d3d3d2d2'd1d1d0d0'c3c3c2c2'c1c1c0c0
    );                                // cmp_mask = d7d6d5d4'c7c6c5c4'b7b6b5b4'a7a6a5a4 d3d2d1d0'c3c2c1c0'b3b2b1b0'a3a2a1a0

    cmp_mask = _mm256_permute4x64_epi64(cmp_mask, 0x8d);
                // cmp_mask = [b7b6b5b4'a7a6a5a4 b3b2b1b0'a3a2a1a0  d7d6d5d4'c7c6c5c4 d3d2d1d0'c3c2c1c0]
    __m256i shuff_idx = _mm256_broadcastsi128_si256(_mm_set_epi64x(0x00010203'08090a0b,0x04050607'0c0d0e0f));
    cmp_mask = _mm256_shuffle_epi8(cmp_mask, shuff_idx);

    // extract bitmask
    uint32_t mask = _mm256_movemask_epi8(cmp_mask);
    sum += _mm_popcnt_u32 (mask);
    // bitwise-or current mask with result bit-vector
    *bitVector32++ |= mask;
  }

  return sum;
}

The idea is to shuffle the bytes before applying a vpmovmskb on it. This takes 5 shuffle operations (including the 3 vpacksswb) for 32 input values, but computation of the sum is done using a popcnt instead of 4 vpsubd. The vpermq (_mm256_permute4x64_epi64) could probably be avoided by strategically loading 128 bit halves into 256 bit vectors before comparing them. Another idea (since you need to shuffle the final result anyway) would be to blend together partial results (this tends to require p5 or 2*p015 on architectures I checked, so probably not worth it).

You could consider packing compare result vectors into 32-bit chunks with 2x `vpackssdw` + `vpacksswb` + vpermq lane crossing fixup to feed `vpmovmskb`, or use scalar operations. I highly dislike `foo` and `foo_` being used in the same function; there's so many other underscores flying around from intrinsics that it's very easy to miss when reading the code; I started writing the comment because I thought you had a bug working with `uint32_t *bitVector`. But not, your `bitVector` isn't the function arg.) — Peter Cordes, Nov 07 '19 at 14:48
One small upside to combining up to 32-bit is that you could scalar popcnt / add once per 4 vectors, vs. `vpaddd` once per vector. Not a big deal, and nice idea to just count on the fly assuming that the vector is less than 2^32 * 8 floats long. — Peter Cordes, Nov 07 '19 at 14:50
Good change with the var names; 8 vs. 32 in the name highlights to the reader that you're accessing the same thing different ways. — Peter Cordes, Nov 07 '19 at 15:00
@PeterCordes I was considering combining 2 or 4 `vmovmskps` results, but could not figure anything which actually increased throughput. I did not think about `2*vpackssdw`+`vpacksswb`+`vpermq`. I assume that would be better (costing `4*p5`, for 8 vectors, but saving `7*p0` for less `movmsk` operations (and in that case, `popcnt` is also likely better). I won't update this answer, feel free to write an improved version. — chtz, Nov 07 '19 at 15:05
@chtz, thanks a lot for your code! I will give a try and let you now! — Liotro78, Nov 07 '19 at 15:19
@chtz: yeah, not sure if it's a real win or not on SKL, Ice Lake, or Ryzen. Also it would need AVX2 where this only needs AVX1 (except `add_epi32`). It probably would be with AVX512 using 2x [`kunpckbw` + `kunpckwd`](https://www.felixcloutier.com/x86/kunpckbw:kunpckwd:kunpckdq) without any "in-lane" vs. "lane-crossing" problems. — Peter Cordes, Nov 07 '19 at 15:32
@PeterCordes Indeed, I just noticed that your suggested shuffling would only merge 4 vectors not 8. So it would bottleneck on `p5` (with still 1cycle per 8 floats). _Might_ be worth it, if the saved `p016` cycles are used otherwise, e.g., for loop control. — chtz, Nov 07 '19 at 15:51
You're assuming yours doesn't bottleneck on front-end throughput. Memory-destination `or` is 2 uops whether it's a byte or a dword, and until Ice Lake or Zen the front-end is only 4-wide. (Although merging costs 1 uop per vector :/). Oh, I just noticed in the question they want to *set or not a bit in a bit vector (using unsigned int) to a specific value* so I'm not sure if they want to merge into the old value or not. Vector merging would have a better chance of being good if you couldn't use AVX and were doing 4 floats per vector: avoids nibble handling and no lane-crossing => 3 shuffles. — Peter Cordes, Nov 07 '19 at 16:16
vcmpps + vmovmskps + vpaddd (1 each, assuming micro-fusion of the load + vcmpps which means avoiding an indexed addressing mode) + `or [mem],reg` (2) is already 5 uops per input vector / per output byte with no room for loop control, even when unrolling for icelake (5-wide issue/rename). I think merging comes out slightly ahead of that if you need to `or` into memory instead of just storing. — Peter Cordes, Nov 07 '19 at 16:20
Regarding merge or not, quoting Liotro78: "2) Yes, it can contain non-zero values before calling the function." (i.e., `or` is necessary at some point). If I counted correctly, and understand [this](https://www.uops.info/html-instr/OR_NOREX_M8_R8.html) correctly, my version has 8 uops per input vector, which can distribute over `p01234567` (i.e., it indeed leaves no room for loop-control). — chtz, Nov 07 '19 at 16:30
@chtz Hello guys. Some news: I tested your proposed code. Maybe I did something wrong but it seems that there is a mismatch between the returned _mm_cvtsi128_si32(sum_reduced) value and the populated bitVector32. In a test I performed your code produced a bit vector with 16 missing 1's. I.e. _mm_cvtsi128_si32(sum_reduced) = 323056, while the bitvector32 only contained 323044 1's. From the scalar C code I can see that the returned sum is correct, while the bit vector is missing 16 1's. Any idea? — Liotro78, Nov 08 '19 at 17:18
@Liotro78 did you handle the remainder of the loop? (I said at the beginning of the answer that you need to do that manually) -- this would only explain up to 7 missing values, though. Can you provide a reduced testcase? — chtz, Nov 08 '19 at 17:38
Hi, the numbers I shown you are based on a case where dim is a multiple of 8. Anyway, I will try to provide you with a test case. — Liotro78, Nov 08 '19 at 17:49
@chtz One update: I realized only now that in my application I actually need to have a big endian order for the bit vector. Maybe this explain the different results. How should your code being modified to work for the big endian bit ordering? — Liotro78, Nov 08 '19 at 19:39
@Liotro78 I added a big-endian variant (untested). Can only work if dim is a multiple of 32. — chtz, Nov 11 '19 at 01:21

Bit vector operation with AVX2 and SSE2

1 Answers1