0

I've run into an unexpected situation when trying to hash pointers using the default implementation of robin_hood::unordered_flat_set from https://github.com/martinus/robin-hood-hashing.

My test case looks like the following:

void test()
{
    std::vector<int*> v{ /* ~4k entries extracted from a real run */ };
    robin_hood::unordered_flat_set<int*> fs;
    std::ranges::for_each(v, [&](int* p) { fs.insert(p); }); // boom!
}

I assume the default hash function is "good" (which the comments indicate is taken from murmurhash3). The diagnostic output indicates that the robin_hood implementation throws an overflow error after unsuccessfully calling try_increase_info 5 times.

I did a quick analysis of the sorted data. All of the data is between 0x7fc768000000 and 0x7fc788000000. The most common difference between adjacent entries is n*128 bytes (0x80), where n is a small #. There are larger gaps in the data as well. Of course, I can easily fix the issue by using std::unordered_set<int*>. The maximum bucket size for the data set is 6, which is pretty reasonable.

The whole point of using a non-standard hash is for performance, but I can't use it if there are correctness issues. I can accept switching my code to (yet) another hash implementation provided that there is a stronger guarantee that a relatively straightforward data sequence of arbitrary pointer sequences won't cause an internal error.

Hashing of pointer values contains some useful information. However, the accepted answer basically states "here are some hash functions that might be of use", which hardly gives me warm fuzzy feelings.

Any advice?

MarkB
  • 672
  • 2
  • 9
  • 1
    Often times `sizeof(std::uintptr_t) == sizeof(std::size_t)`, so `std::size_t{reinterpret_cast(pointer_value)}` is a perfect hash function (it has no collisions and it takes essentially zero cycles to run). It seems like the library you're using might have run into a bug – Artyer Oct 03 '22 at 18:26
  • @Artyer, that is an orthogonal point. Worst-case behavior (e.g. O(n) for find()) can still occur with a perfect hash function. – MarkB Oct 03 '22 at 18:39
  • Looking at the thing, the byte hash function appears to be MurmurHash64A, the integer hash function (apparently used for pointers) appears to be the Murmurhash3 finalizer, but possibly isn't good as a hash function on its own for your data? – Hasturkun Oct 03 '22 at 18:54
  • @Hasturkun, I'm not really in control of the data since the pointer values are returned by the memory allocator. – MarkB Oct 03 '22 at 19:08
  • Looks like the author no longer is providing updates to the code and has moved their efforts to a new implementation...that is probably telling. – MarkB Oct 04 '22 at 12:55

0 Answers0