Hard to tell just by looking at the (pretty much non-existent) documentation of CMPH. Digging into the source code seems simpler. The implementation of the internal hash function used by CMPH can be found in the hash()
function, which ends up calling __jenkins_hash_vector()
. This hash function was originally designed by Robert J. Jenkins Jr. in 1997 and can be found here. As far as the function is concerned, nothing weird happens with the key used by the function, so this hash function can be safely used even for invalid (non present) keys.
The cmph_search()
function calls the correct *_search()
function based on the algorithm you configured (CHD, BDZ, BMZ, and so on). Then hash()
is called and the resulting values are used in different ways depending on the algorithm.
For simpler algorithms such as BMZ, BMZ8 and FCH I can see that the hash is simply used to index an internal array (mphf->data->g
). All the accesses are performed modulo the size of this array (mphf->data->n
) so this looks fine. Just from this, I would say if you are using these algorithms you are safe. For more complex algorithms like BDZ it's a bit harder to understand what is really going on and where/how the calculated hashes are actually used for.
Taking a look at the tests implemented in the library source, (for example at this one), we can see that the author uses a logic similar to yours to detect whether a key is a duplicate or unknown:
/* ... */
cmph_uint32 siz = cmph_size(mphf);
hashtable = (cmph_uint8*)malloc(siz*sizeof(cmph_uint8));
memset(hashtable, 0, (size_t)siz);
//check all keys
for (i = 0; i < source->nkeys; ++i)
{
cmph_uint32 h;
char *buf;
cmph_uint32 buflen = 0;
source->read(source->data, &buf, &buflen);
h = cmph_search(mphf, buf, buflen);
if (!(h < siz))
{
fprintf(stderr, "Unknown key %*s in the input.\n", buflen, buf);
ret = 1;
} else if(hashtable[h])
{
fprintf(stderr, "Duplicated or unknown key %*s in the input\n", buflen, buf);
ret = 1;
} else hashtable[h] = 1;
if (verbosity)
{
printf("%s -> %u\n", buf, h);
}
source->dispose(source->data, buf, buflen);
}
/* ... */
The only thing that's missing from the above code is storing and comparing the keys like you are doing in your example. At the end of the day, it looks to me like calling cmph_search()
with an unknown key is fine. Understanding whether the key is unknown or not given the resulting hash is then the job of the library user.