We have to perform bit wise XOR operation on two arrays each containing 5 elements of uint64_t (unsigned long long) and then perform counting (pop count) of 1's. What is the optimized way by using AVX2 256 bit wide YMM registers, AVX2 VPXOR and popcount to achieve this in minimum clock cycles.
Right now we are doing this by following code snippet
for (j = 0; j < 5; j++){
xorResult = cylinderArrayVectorA[j] ^ cylinderArrayVectorB[j];
noOfOnes = _mm_popcnt_u64(xorResult);
sumOfOnes += noOfOnes;
We have 260 bits in array A and array B. What is the optimized way to perform AVX2 VPXOR and popcount in minimum clock cycles.