0

We have to perform bit wise XOR operation on two arrays each containing 5 elements of uint64_t (unsigned long long) and then perform counting (pop count) of 1's. What is the optimized way by using AVX2 256 bit wide YMM registers, AVX2 VPXOR and popcount to achieve this in minimum clock cycles.

Right now we are doing this by following code snippet

for (j = 0; j < 5; j++){
 xorResult = cylinderArrayVectorA[j] ^ cylinderArrayVectorB[j];
 noOfOnes = _mm_popcnt_u64(xorResult);
 sumOfOnes += noOfOnes;

We have 260 bits in array A and array B. What is the optimized way to perform AVX2 VPXOR and popcount in minimum clock cycles.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    Is this time-critical? If so, where does the data come from? Are you able to interleave this with some other calculation? Are you sure about 260 bits (i.e., 256+4 bits?) – chtz Mar 27 '19 at 10:26
  • 1
    A good compiler will vectorize this loop for you. See e.g. [this example for clang 8.0](https://godbolt.org/z/XLMhlC). Interestingly [gcc 8.3 just unrolls the loop and doesn't vectorize](https://godbolt.org/z/nMlEwD). Try benchmarking each one to see which is faster. – Paul R Mar 27 '19 at 10:30
  • 2
    @PaulR: clang knows that vectorizing popcnt with a `pshufb` LUT is a win with AVX2 (for large arrays); gcc doesn't have that recipe / pattern built-in so it's always going to miss that optimization. For exactly 5 elements, it the horizontal-sum overhead might outweigh scalar 5x load/xor reg,mem/popcnt and 4x scalar add. But maybe not if the vector constants are used repeatedly for multiple 5-element arrays. Clang is using scalar popcnt for the 5th element, so that's good. (But it would make more sense to guess the array might be 32B-aligned and do scalar for the last element, not first.) – Peter Cordes Mar 27 '19 at 10:44
  • 3
    And BTW, AVX2 SIMD popcnt isn't a win on Ryzen (for large arrays), IIRC, only on Intel. Ryzen has 4-per-clock throughput for scalar popcnt. Anyway, the best choice here may depend on the surrounding code. This is too short for a metric like "clock cycles" to be sufficient; the surrounding code might bottleneck on the front-end, on the latency of this (with recently-written arrays), or on back-end throughput (port 5 vs. port 1 bottlenecks?) See [static perf analysis](//stackoverflow.com/q/51607391) – Peter Cordes Mar 27 '19 at 10:53

0 Answers0