Good hash function over C++ unordered_set

Question

I'm looking to implement a hash function over a C++ std::unordered_set<char>. I initially tried using boost::hash_range:

namespace std
{
template<> struct hash<unordered_set<char> >
size_t operator(const unordered_set<char> &s)(
{
    return boost::hash_range(begin(s), end(s))
};
}

But then I realised that because the set is unordered, the iteration order isn't stable, and the hash function is thus wrong. What are some better options for me? I guess I could std::set instead of std::unordered_set, but using an ordered set just because it's easier to hash seems ... wrong.

You could hash the number of elements in the unordered set. Be aware that comparing your unordered sets when resolving a hash [will be very expensive](http://stackoverflow.com/q/10118551/1553090) — paddy, Feb 29 '16 at 04:50
I guess that furthers the case for using a std::set instead. Thanks. — Ambarish Sridharanarayanan, Feb 29 '16 at 04:57
Seems the only other way is to create a temporary copy and sort that. If hashing the unordered_set is an infrequent operation this could be more reasonable I guess... — Jarra McIntyre, Feb 29 '16 at 04:58
You only need 256 bits to track the characters in the set (less if you're after printable 7-bit ASCII): could encode them contiguously in a `struct` of four `uint64_t`, then `boost::hash_combine` the `uint64_t` members. — Tony Delroy, Feb 29 '16 at 04:59
Good point. In that case [`std::bitset`](http://en.cppreference.com/w/cpp/utility/bitset) would be the most straight-forward, as there is already a [hash specialization](http://en.cppreference.com/w/cpp/utility/bitset/hash). — paddy, Feb 29 '16 at 05:02

score 3 · Answer 1 · answered Feb 29 '16 at 05:35

3

You could try simply adding which is independent of order and returning the hash of that:

template<> struct hash<unordered_set<char> >
size_t operator(const unordered_set<char> &s) {
     long long sum{0};
     for ( auto e : s )
          sum += s;
     return std::hash(sum);
};

answered Feb 29 '16 at 05:35

Paul Evans

27,315
3
37
54

1

Also it is better to use a little more complex function than some of all values. For instance, sum of squares of values usually produces less quantity of collisions (just compare sets (1,1,1), (0,1,2), (0,0,3) - sums of all of them are equal, but sums of squares are different). Any way, it depends on a type of the data. But I'd recommend to use something like this. – Ilya Feb 29 '16 at 05:42
1

You're making a lot of collision. It should be the binary XOR of hash(element i) for every i, instead of hash(sum of element i). – DU Jiaen Feb 29 '16 at 06:56
@DUJiaen I thought about that but remember we're dealing with `char` here and I chose a `long long` sum over an 8 bit XOR. – Paul Evans Feb 29 '16 at 13:39
PaulEvans: it's still awful - for example, even for sets with only 3 letters, there are 78 different letter combinations totally each of 327, 328, 329, and 330, and the sums only range from 294 to 363. For sum 329 for example: `anz aoy apx aqw arv asu bmz bny box bpw bqv bru bst clz cmy cnx cow cpv cqu crt dkz dly dmx dnw dov dpu dqt drs ejz eky elx emw env eou ept eqs fiz fjy fkx flw fmv fnu fot fps fqr ghz giy gjx gkw glv gmu gnt gos gpr hix hjw hkv hlu hmt hns hor hpq ijv iku ilt ims inr ioq jkt jls jmr jnq jop klr kmq knp lmp lno`. Note DU Jiaen suggested XOR of **hash**(element). – Tony Delroy Mar 02 '16 at 00:03
@TonyD Of course it depends on the sample space. I was thinking more along the lines of much larger sets with more diverse elements that would fill up a `long long` with different additions. For you case, XOR is much better. – Paul Evans Mar 02 '16 at 00:09

score 3 · Accepted Answer · edited May 23 '17 at 12:16

A very similar question, albeit in C#, was asked here:

Hash function on list independant of order of items in it

Over there, Per gave a nice language-independent answer that should put you on the right track. In short, for the input

x₁, …, x_n

you should map it to

f(x₁) op … op f(x_n)

where

f is a good hash function for single elements (integer in your case)
op is a commutative operator, such as xor or plus

Hashing an integer may seam pointless at first, but your goal is to make two neighboring integers be dissimilar from each other, so that when combined with op do not create the same result. e.g. if you use + as the operator, you want f(1)+f(2) to give a different result than f(0)+f(3).

If standard hashing functions are not good candidates for f and you cannot find one, check the linked answer for more details...

Good hash function over C++ unordered_set

2 Answers2