Very fast hash table lookup in C (e.g. by MPH)

Question

I need a very fast hash table in C (or C++). Conditions are like this:

There exist N known keys which shall map to an object (with some state)
There exist more unknown keys which do not map to anything

Because all keys which map to an object are known at startup (but not at compile-time), it's okay if building the hashtable is expensive. However, it's required that lookup is (very) fast.

I thought about using cmph (for a perfect minimal perfect hash function). The hash table would be built with the N keys and at runtime I would do a query like this:

const cmph_uint32 id = cmph_search(hash, &key, sizeof(key));

if (id >= N) {
   return; // object not found
}

const MyState *state = &states[id];
if (state->key != key) {
  return; // object not found
}

// object found

By storing the actual key in the state, it should be possible to detect if we have an invalid collision. However, I'm not sure if calling cmph_search with a "unknown" key is undefined behavior (e.g. weird memory access or something).

Maybe someone has a better idea? Or maybe someone knows if calling cmph_search with a unknown key is fine?

does `https://www.gnu.org/software/gperf/manual/gperf.html` help you? — tstanisl, Feb 16 '23 at 12:15
I think gperf generates C-code, so the keys must be known at compile-time. Unfortunately, I only know the keys at runtime (but at the startup). — Kevin Meier, Feb 16 '23 at 12:27
"very fast hash table in ... or *C++*" -> it's easy to google for speed comparisons e.g. https://martin.ankerl.com/2022/08/27/hashmap-bench-01/ - with modern SIMD techniques, they'll be considerably faster with most key sets than minimal perfect hashing (which typically requires more work - e.g. dependency on tables of support data that have to be loaded from cache/RAM - in the hash function itself). With SIMD a bunch of buckets can be searched in parallel, so up to a certain level of collisions don't impact lookup performance, but be wary of identity hashes with power-of-2 bucket counts... — Tony Delroy, Feb 22 '23 at 23:44

score 1 · Accepted Answer · answered Feb 16 '23 at 13:52

Hard to tell just by looking at the (pretty much non-existent) documentation of CMPH. Digging into the source code seems simpler. The implementation of the internal hash function used by CMPH can be found in the hash() function, which ends up calling __jenkins_hash_vector(). This hash function was originally designed by Robert J. Jenkins Jr. in 1997 and can be found here. As far as the function is concerned, nothing weird happens with the key used by the function, so this hash function can be safely used even for invalid (non present) keys.

The cmph_search() function calls the correct *_search() function based on the algorithm you configured (CHD, BDZ, BMZ, and so on). Then hash() is called and the resulting values are used in different ways depending on the algorithm.

For simpler algorithms such as BMZ, BMZ8 and FCH I can see that the hash is simply used to index an internal array (mphf->data->g). All the accesses are performed modulo the size of this array (mphf->data->n) so this looks fine. Just from this, I would say if you are using these algorithms you are safe. For more complex algorithms like BDZ it's a bit harder to understand what is really going on and where/how the calculated hashes are actually used for.

Taking a look at the tests implemented in the library source, (for example at this one), we can see that the author uses a logic similar to yours to detect whether a key is a duplicate or unknown:

    /* ... */

    cmph_uint32 siz = cmph_size(mphf);
    hashtable = (cmph_uint8*)malloc(siz*sizeof(cmph_uint8));
    memset(hashtable, 0, (size_t)siz);
    //check all keys
    for (i = 0; i < source->nkeys; ++i)
    {
        cmph_uint32 h;
        char *buf;
        cmph_uint32 buflen = 0;
        source->read(source->data, &buf, &buflen);
        h = cmph_search(mphf, buf, buflen);
        if (!(h < siz))
        {
            fprintf(stderr, "Unknown key %*s in the input.\n", buflen, buf);
            ret = 1;
        } else if(hashtable[h])
        {
            fprintf(stderr, "Duplicated or unknown key %*s in the input\n", buflen, buf);
            ret = 1;
        } else hashtable[h] = 1;

        if (verbosity)
        {
            printf("%s -> %u\n", buf, h);
        }
        source->dispose(source->data, buf, buflen);
    }

    /* ... */

The only thing that's missing from the above code is storing and comparing the keys like you are doing in your example. At the end of the day, it looks to me like calling cmph_search() with an unknown key is fine. Understanding whether the key is unknown or not given the resulting hash is then the job of the library user.

Very fast hash table lookup in C (e.g. by MPH)

1 Answers1