For my open source project cachegrand we are implementing AARCH64 support and although most of the port is completed we are sorting out a feature to perform an accelerated array search using NEON instructions.
The logic we use is pretty simple:
- in input there is an array of 14 uint32 elements, the value to find and a mask to ignore certain matches
- the code has to find any value that matches a specific uint32
- build a bitmask
- the least significant bits of the bitmask match the begin of the array
- the bitmask is then & with the skip indices mask
- and then the trailing zeros are counted to determine the index of the first occurance
It's a very rare occurance that the skip indices mask is actually used, I would say that 99.9% of the cases will be zero.
I have come up with the following implementation, but I have no experience with ARMv8 NEON instruction and feels a bit clunky, especially so I was wondering if there is a way to make it faster and/or better.
For reference, currently the code is compiled only with GCC.
uint8_t hashtable_mcmp_support_hash_search_armv8a_neon_14(
uint32_t hash,
volatile uint32_t* hashes,
uint32_t skip_indexes_mask) {
uint32x4_t tmp;
uint32_t compacted_result_mask = 0;
uint32_t skip_indexes_mask_inv = ~skip_indexes_mask;
static const int32x4_t shift = {0, 1, 2, 3};
uint32x4_t cmp_vector = vdupq_n_u32(hash);
uint32x4_t ring_vector_0_3 = vld1q_u32((hashtable_hash_half_t*)hashes + 0);
uint32x4_t cmp_vector_0_3 = vceqq_u32(ring_vector_0_3, cmp_vector);
tmp = vshrq_n_u32(cmp_vector_0_3, 31);
compacted_result_mask |= vaddvq_u32(vshlq_u32(tmp, shift)) << 0;
uint32x4_t ring_vector_4_7 = vld1q_u32((hashtable_hash_half_t*)hashes + 4);
uint32x4_t cmp_vector_4_7 = vceqq_u32(ring_vector_4_7, cmp_vector);
tmp = vshrq_n_u32(cmp_vector_4_7, 31);
compacted_result_mask |= vaddvq_u32(vshlq_u32(tmp, shift)) << 4;
uint32x4_t ring_vector_8_11 = vld1q_u32((hashtable_hash_half_t*)hashes + 8);
uint32x4_t cmp_vector_8_11 = vceqq_u32(ring_vector_8_11, cmp_vector);
tmp = vshrq_n_u32(cmp_vector_8_11, 31);
compacted_result_mask |= vaddvq_u32(vshlq_u32(tmp, shift)) << 8;
uint32x4_t ring_vector_10_13 = vld1q_u32((hashtable_hash_half_t*)hashes + 10);
uint32x4_t cmp_vector_10_13 = vceqq_u32(ring_vector_10_13, cmp_vector);
tmp = vshrq_n_u32(cmp_vector_10_13, 31);
compacted_result_mask |= vaddvq_u32(vshlq_u32(tmp, shift)) << 10;
return __builtin_ctz(compacted_result_mask & skip_indexes_mask_inv);
}
Just for reference, here the AVX2 code
static inline uint8_t hashtable_mcmp_support_hash_search_avx2_14(
uint32_t hash,
volatile uint32_t* hashes,
uint32_t skip_indexes_mask) {
uint32_t compacted_result_mask = 0;
uint32_t skip_indexes_mask_inv = ~skip_indexes_mask;
__m256i cmp_vector = _mm256_set1_epi32(hash);
// The second load, load from the 6th uint32 to the 14th uint32, _mm256_loadu_si256 always loads 8 x uint32
for(uint8_t base_index = 0; base_index < 12; base_index += 6) {
__m256i ring_vector = _mm256_loadu_si256((__m256i*) (hashes + base_index));
__m256i result_mask_vector = _mm256_cmpeq_epi32(ring_vector, cmp_vector);
// Uses _mm256_movemask_ps to reduce the bandwidth
compacted_result_mask |= (uint32_t)_mm256_movemask_ps(_mm256_castsi256_ps(result_mask_vector)) << (base_index);
}
return _tzcnt_u32(compacted_result_mask & skip_indexes_mask_inv);
}
On a side question, do you think it's worth to implement support for SVE2 instructions? Especially taking into account that this is a pretty simple operation and looks like there might not be mandatory support for 256 bits registers (which probably would be the biggest benefit of using SVE2 in this specific context)