C++ Hash Restriction

Question

The probability of h(a)==h(b) for a!=b should approach 1.0/std::numeric_limits<std::size_t>::max().

I want to create a hash table of pairs (a, b), where (a, b) == (b, a) (unordered pair), so my hash function is:

struct hash_pair {
  template<class T>
  std::size_t operator()(std::pair<T, T> const& p) const
  {
     std::hash<T> h;
     return std::hash<std::size_t>(h(p.first) + h(p.second));
  }
};

Assuming that h(ti) and std::hash<std::size_t> fulfill the requirement, will hash_pair fulfill it as well?

After further thinking:

(some extra details)

p.first != p.second by precondition of my use case.
T will be std::size_t in the majority of the cases, whose hash value is itself, so h(n) == n and thus P(n1 == n2) when n1 != n2 is 0.
Since the sum is commutative, hash(pair(n1, n2)) == hash(pair(n2, n1)), which is intented.

So we have got only two cases where two pairs can be different, when they have only one element in common, or when there have none:

 P1 = P(n1 + n2 == n1 + n3) = P(n2 == n3) = 0 // Because n2 != n3
 P2 = P(n1 + n2 == n3 + n4) = ? // n1 != n3 and n2 != n4

So my problem is reduced to calculate P(none_in_common) * P(n1 + n2 == n3 + n4). P(none_in_common) is use case specific (this probability will probably be high in my case), but, what about P2? Any help here?

NOTE: My question is not a duplicate of other similars questions around here because I'm asking about the statistical properties of my proposed hash function, not about how to do it.

Related: [How to std::hash an unordered std::pair](https://stackoverflow.com/q/28367913/1782792). — jdehesa, Jul 30 '19 at 19:13
Possible duplicate of [How to std::hash an unordered std::pair](https://stackoverflow.com/questions/28367913/how-to-stdhash-an-unordered-stdpair) — Andriy Tylychko, Jul 30 '19 at 23:48

ABu · Answer 1 · 2019-07-30T22:22:25.033

1

It doesn't fullfill the property because the final probability calculation has nothing to do with the hash probability. It must be calculated independently and you cannot apply any algebraic properties to it in my understanding.

The probability of four different numbers giving same hash, from this question made by me as well, with a more mathematical approach is (n is the domain of each number):

 (2 * n^2 + 4 * n + 3) / (3 * (n + 1) ^ 3)

which gives approximately 3.61e-20, which is more than perfect (1.5 times worst than hashing a single number, but still a negligible probability). That must be multiplied by the probability of having two pairs of totally different numbers.

NOTE I'm wrong in my first sentence. Due to modular arithmetic overflow, sum of hashes are uniformly distributed if the hash function itself is.

edited Jul 30 '19 at 22:22

answered Jul 30 '19 at 21:54

ABu

10,423
6
52
103

1

if `h(a)` is evenly distributed along std::size_t, and `h(b)` is also evenly distributed along `std::size_t`, then `h(a)+h(b)` is also evenly distributed along std::size_t _due to rollover_. http://coliru.stacked-crooked.com/a/40e05275b3a753fb Without rollover, you'd be correct. – Mooing Duck Jul 30 '19 at 22:09
@MooingDuck Interesting stuff. Thanks. – ABu Jul 30 '19 at 22:18

C++ Hash Restriction

1 Answers1