Create custom Hash Function

Question

I tried to implement an unordered map for a Class called Pair, that stores an integer and a bitset. Then I found out, that there isn't a hashfunction for this Class. Now I wanted to create my own hashfunction. But instead of using the XOR function or comparable functions, I wanted to have a hashfunction like the following approach:

the bitsets in my class obviously have fixed size, so I wanted to do the following:

example: for a instance of Pair with the bitset<6> = 101101, and the integer 6:

create a string = "1011016"
and now use the default hashfunction on this string
because the bitsets have fixed size, each key would be unique

how could I implement this approach?

thank you in advance

Does this answer your question? [How to specialize std::hash for user defined types?](https://stackoverflow.com/questions/24361884/how-to-specialize-stdhasht-for-user-defined-types) — lorro, Jun 28 '22 at 21:40
`'101101', 6` and `'10110', 16` have the same hash. Your hash has predictable collisions and that can be used to attack your code. You should convert the integer to bitset, join them and use the hash> function. Or add a separator "101101:6" so every key has a unique string. — Goswin von Brederlow, Jun 28 '22 at 21:51
@GoswinvonBrederlow 10110 cannot appear as a bitset<6> because the size is just 5. It would be unique, because the first six characters are always from the bitset and the rest is the integer — m6rco, Jun 28 '22 at 21:55
Your approach would be pretty slow. Why not ```(Integer << 6) | bitset.to_ulong()``` Or if you would like to preserve the upper bits of Integer, ```std::rotl(Integer, 6) ^ bitset.to_ulong()``` — Homer512, Jun 28 '22 at 22:12
@Homer512 im sorry, but what do you mean by this approach, could you maybe implement the short hashfunction? — m6rco, Jun 28 '22 at 23:01
If your bitset is always size 6 (or anything <= 32) then combine the bitset and the int into a single uint64_t and hash that (or use that as hash even). — Goswin von Brederlow, Jun 29 '22 at 02:17

Homer512 · Accepted Answer · 2022-06-29T09:08:27.110

To expand on a comment, as requested:

Converting to string and then hashing that string would be somewhat slow. At least slower than it needs to be. A faster approach would be to combine the bit patterns, e.g. like this:

struct Pair
{
  std::bitset<6> bits;
  int intval;
};

template<>
std::hash<Pair>
{
  std::size_t operator()(const Pair& pair) const noexcept
  {
     std::size_t rtrn = static_cast<std::size_t>(pair.intval);
     rtrn = (rtrn << pair.bits.size()) | pair.bits.to_ulong(); 
     return rtrn;
  }
};

This works on two assumptions:

The upper bits of the integer are generally not interesting
The size of the bitset is always small compared to size_t

I think it is a suitable hash function for use in unordered_map. One may argue that it has poor mixing and a very good hash should change many bits if only a few bits in its input change. But that is not required here. unordered_map is generally designed to work with cheap hash functions. For example GCC's hash for builtin types and pointers is just a static- or reinterpret-cast.

Possible improvements

We can preserve the upper bits by rotating instead of shifting.

template<>
std::hash<Pair>
{
  std::size_t operator()(const Pair& pair) const noexcept
  {
     std::size_t rtrn = static_cast<std::size_t>(pair.intval);
     std::size_t intdigits = std::numeric_limits<decltype(pair.intval)>::digits;
     std::size_t bitdigits = pair.bits.size();
     // can  be simplified to std::rotl(rtrn, bitdigits) in C++20
     rtrn = (rtrn << bitdigits) | (rtrn >> (intdigits - bitdigits));
     rtrn ^= pair.bits.to_ulong();
     return rtrn;
  }
};

Nothing will change for small integers (except some bitflips for small negative ints). But for large integers we still use the whole range of inputs, which might be of interest for pathological cases such as integer series 2^30, 2^30 + 2^29, 2^30 + 2^28, ...

If the size of the bitset may increase, stop doing fancy stuff and just combine the hashes. I wouldn't just xor them to avoid hash collisions on small integers.

std::hash<Pair>
{
  std::size_t operator()(const Pair& pair) const noexcept
  {
     std::hash<decltype(pair.intval)> ihash;
     std::hash<decltype(pair.bits)> bhash;
     return ihash(pair.intval) * 31 + bhash(pair.bits);
  }
};

I picked the simple polynomial hash approach common in Java. I believe GCC uses the same one internally for string hashing. Someone else may expand on the topic or suggest a better one. 31 is commonly chosen as it is a prime number one off a power of two. So it can be computed quickly as (x << 5) - x

thank you very much for your answer! I just have one question, how did you come up with the return statement in the last code block, or why did u choose the times 31 there? — m6rco, Jun 29 '22 at 08:51
*"not required here. `unordered_map` is generally designed to work with cheap hash functions. For example GCC's hash for builtin types and pointers is just a static- or reinterpret-cast."* - the Standard doesn't specify much about `unordered_map`, and Visual C++ chooses to use power-of-two bucket counts, which effectively bitwise-AND / mask-out low order bits and discard high order bits, so your hash function would be awful there. GCC and clang use power-of-2 bucket counts, so your function would do much better. — Tony Delroy, Jul 02 '22 at 23:17
Use of identity hashing doesn't always reflect a lot of tolerance for weak hashing... often it means "this is a no-op - so fast - and good enough when the keys aren't collision prone anyway (when folded into the bucket count)". It avoids work it may not need to do. But for collision prone keys, especially under Visual C++, using a decent custom hash will really help. — Tony Delroy, Jul 02 '22 at 23:19
@TonyDelroy Does VC put extra work into ```std::hash``` or in their collision resolution? I know Python also uses identity hashing + power-of-2 buckets but they use open addressing + decent collision resolution. They also claim that it improves locality of data, e.g. when integers that are off-by-one are stored in neighboring buckets — Homer512, Jul 02 '22 at 23:51
Just noticed I typed "GCC and clang use power-of-2 bucket counts" - meant to say prime that time. Anyway - no to VC and "extra work" - the C++ Standard requires separate chaining - if you get collisions, you just have to add them in the chain... can't tune things as much as you can with open addressing, but even then there are always compromises - linear probing is cache friendly and you can - if you want - collapse when erasing instead of using tombstones, but it's collision prone. Quadratic probing and rehashing not so cache friendly. — Tony Delroy, Jul 03 '22 at 21:29
Advanced non-Standard implementations like Facebook Folly F14, Google abseil etc. use some neat tricks these days, like SIMD instructions to compare against a byte of 16 different hash values at once - so you can quickly skip to the ones that might match. — Tony Delroy, Jul 03 '22 at 21:30

Create custom Hash Function

1 Answers1

Possible improvements