-2

I’ve looked at Murmur3 and Meow but they both seem to be optimized for bandwidth when hashing long arrays. I don’t have any arrays, I only have uint32_t integers on input. My inputs are small non-negative numbers usually under a few millions, all divisible by 3. The probability density is uniform over the range [ 0 .. N*3 ] for some integer number N.

Is the following code good enough in terms of performance, and quality of distribution?

// Count of bits in the complete Bloom filter
// Using 32kb or memory to hopefully stay in L1D cache
constexpr size_t countBits = 1024 * 32 * 8;

// These bytes are from random.org
// Cryptographic strength is irrelevant, we only need performance here.
static const __m128i seed = _mm_setr_epi8( (char)0x8a, (char)0xf4, 0x30, (char)0x91, (char)0x07, 0x05, 0x45, (char)0x99, 0x2f, (char)0x95, 0x4a, (char)0xa2, (char)0x84, (char)0x88, (char)0xe6, (char)0x09 );

// Mask to extract lower bits from the hash
static const __m128i andMask = _mm_set1_epi32( countBits - 1 );

// Compute 16 bytes of hash function, mask upper bits in every 32-bit lane, computing 4 bit positions for the Bloom filter.
inline __m128i hash( uint32_t key )
{
    // Broadcast integer to 32-bit lanes
    __m128i a = _mm_set1_epi32( (int)key );
    // Abuse AES-NI decrypting instruction to make a 16-byte hash.
    // That thing only takes 4 cycles of latency to complete.
    a = _mm_aesdec_si128( a, seed );
    // Mask upper bits in the lanes so the _mm_extract_epi32 returns position of bits in the Bloom filter to set or test
    a = _mm_and_si128( a, andMask );
    return a;
}

Update: the question is not opinion based because count of hash collisions is trivially measurable. It’s for this very reason I’ve specified the distribution of my input values.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • Opinion based? Count of hash collisions is trivially measurable. – Soonts Feb 13 '22 at 17:18
  • 2
    Define _"good enough"_ precisely, if you don't want to end up with opinionated answers here please. What are your actual requirements? If these measurement data you want to base potential answers on can be gathered trivially, why didn't you add your current measures? – πάντα ῥεῖ Feb 13 '22 at 17:25
  • @πάνταῥεῖ Minimizing count of hash collisions with my input values. And, not wasting too much CPU time doing so because my Bloom filter doesn’t save disk I/O, it’s there to save CPU time spent intersecting large hash sets of numbers. – Soonts Feb 13 '22 at 17:28
  • 1
    Is your code "good enough" to achieve _what_, exactly? – Drew Dormann Feb 13 '22 at 17:35
  • @DrewDormann Yeah, it appears to work well. For one, I might be missing something obvious, I’m not an expert in cryptography or hash functions. Another thing, might be possible to improve. For instance, I’m not sure at all I need the initial broadcast instruction, the aesdec might result in good enough hashes with the integer just in the lowest lane of the vector. – Soonts Feb 13 '22 at 17:38
  • You may be interested in posting to https://codereview.stackexchange.com/ if I'm correctly understanding your question to be along the lines of "This code works, but could it be improved?" – Drew Dormann Feb 13 '22 at 18:19

1 Answers1

1

It sounds like you might not need to use a hash function at all:

My inputs are small non-negative numbers usually under a few millions, all divisible by 3. The probability density is uniform over the range [ 0 .. N*3 ] for some integer number N.

The whole point of using a hash function for a bloom filter is that given an input whose probability density is not uniform, to make it uniform. Of course, even if statistically every value is equally likely to occur, it can still be that a given input is correlated with a previous input (say, values close together are likely to appear in groups), in which case you still might need to apply a hash function.

Is the following code good enough in terms of performance, and quality of distribution?

That is hard to answer, only you can tell if it meets the requirements of your application. A single round of AES should not be considered cryptographically secure, and even if it were, using a known fixed seed would allow an attacker that can control the inputs to your Bloom filter to distribute the inputs in such a way that your Bloom filter is very inefficient. But you said it yourself:

[...] count of hash collisions is trivially measurable.

You can measure the distribution of hash values before and after applying the AES decryption round on realistic input, and thus check if it is good enough for you.

G. Sliepen
  • 7,637
  • 1
  • 15
  • 31
  • Thanks for the answer, but I don’t understand 1 thing. Let’s say N = 2 million (in reality it’s arbitrary, depends on the end user choices). If I divide my keys by 3, this results in about 21 bits of data for the key. However, for my Bloom filter with 32 kilobytes of data in the filter and 4 hash functions, I need 72 bits of hashes in total, because 18 bits for each hash function (2^18 = 32 kb of bits). – Soonts Feb 13 '22 at 18:35
  • With that example I would say, why not extend your Bloom filter to be 81.4 kB, then you have one bit per possible input value, so you get a perfect error-free hash just using the key divided by 3. – G. Sliepen Feb 13 '22 at 18:55
  • I’m not sure how to do that because N depends on user’s choices. Specifically, N is the count of triangles in a high-poly 3D model, not necessarily power of 2. BTW, for small N-s I don’t need any of that complexity, `std::unordered_set` is fast enough for computing intersections of my 3 sets, the Bloom filter is an optimization for large N-s. – Soonts Feb 13 '22 at 19:15
  • Hm, now it is beginning to sound like an [XY problem](https://en.wikipedia.org/wiki/XY_problem). Maybe it would help if you create a new question on StackOverflow where you describe the whole problem you are having (finding intersections/collisions of 3D models?), what the algorithm you use to find those intersections is, and how to speed that up. A Bloom filter might not be the best solution, – G. Sliepen Feb 13 '22 at 20:45
  • I already know how to solve the complete problem. CPU sampling profiler took me to the intersection of `uint32_t` hash sets, and Bloom filter helped to optimize. I’m just not sure about this particular implementation details. Most articles on the internets are using Bloom filters for (a) large values like strings, i.e. they’re using hash functions optimized for bandwidth not latency, and (b) to save disk I/O or network requests, in these use cases it’s a performance win regardless on the latency of hash functions, I/O is way more expensive anyway. – Soonts Feb 13 '22 at 21:03
  • Why are the hash sets too slow? Cache misses? Maybe the Bloom filter will help if it's in L1 cache, is that why you chose 32 kB for its size? But if you already profiled your code, then you should be able to calculate a budget for how many cycles you can spend on the Bloom set before its overhead negates its benefits. – G. Sliepen Feb 13 '22 at 22:35
  • Hash sets were slow simply because it was too much data for them. I have many of these queries to run and the time affects UX. Adding 40k random numbers into a hash set on every query was simply too slow for high-poly meshes. Each of the 3 sets being intersected is relatively large, but their intersection is very small. The 3 sets are results of range queries for XYZ coordinates, using two binary searches over a single std::vector with a special index which can find both lower and upper bounds. – Soonts Feb 13 '22 at 23:55
  • Despite the false positive rate in the bloom filter can approach 98% in my setup, the filter still reduced size of the hash map from 20-40k entries to something very reasonable, like 1000-2000. At that smaller scale, hash map operations became pretty much instant. Testing 40k items with a bloom filter is very fast too, most of the elements being tested aren’t there so the main branch is very predictable. – Soonts Feb 13 '22 at 23:55