I’ve looked at Murmur3 and Meow but they both seem to be optimized for bandwidth when hashing long arrays. I don’t have any arrays, I only have uint32_t
integers on input. My inputs are small non-negative numbers usually under a few millions, all divisible by 3. The probability density is uniform over the range [ 0 .. N*3 ] for some integer number N.
Is the following code good enough in terms of performance, and quality of distribution?
// Count of bits in the complete Bloom filter
// Using 32kb or memory to hopefully stay in L1D cache
constexpr size_t countBits = 1024 * 32 * 8;
// These bytes are from random.org
// Cryptographic strength is irrelevant, we only need performance here.
static const __m128i seed = _mm_setr_epi8( (char)0x8a, (char)0xf4, 0x30, (char)0x91, (char)0x07, 0x05, 0x45, (char)0x99, 0x2f, (char)0x95, 0x4a, (char)0xa2, (char)0x84, (char)0x88, (char)0xe6, (char)0x09 );
// Mask to extract lower bits from the hash
static const __m128i andMask = _mm_set1_epi32( countBits - 1 );
// Compute 16 bytes of hash function, mask upper bits in every 32-bit lane, computing 4 bit positions for the Bloom filter.
inline __m128i hash( uint32_t key )
{
// Broadcast integer to 32-bit lanes
__m128i a = _mm_set1_epi32( (int)key );
// Abuse AES-NI decrypting instruction to make a 16-byte hash.
// That thing only takes 4 cycles of latency to complete.
a = _mm_aesdec_si128( a, seed );
// Mask upper bits in the lanes so the _mm_extract_epi32 returns position of bits in the Bloom filter to set or test
a = _mm_and_si128( a, andMask );
return a;
}
Update: the question is not opinion based because count of hash collisions is trivially measurable. It’s for this very reason I’ve specified the distribution of my input values.