Hash value for a std::unordered_map

Question

According to the standard there's no support for containers (let alone unordered ones) in the std::hash class. So I wonder how to implement that. What I have is:

std::unordered_map<std::wstring, std::wstring> _properties;
std::wstring _class;

I thought about iterating the entries, computing the individual hashes for keys and values (via std::hash<std::wstring>) and concatenate the results somehow.

What would be a good way to do that and does it matter if the order in the map is not defined?

Note: I don't want to use boost.

A simple XOR was suggested, so it would be like this:

size_t MyClass::GetHashCode()
{
  std::hash<std::wstring> stringHash;
  size_t mapHash = 0;
  for (auto property : _properties)
    mapHash ^= stringHash(property.first) ^ stringHash(property.second);

    return ((_class.empty() ? 0 : stringHash(_class)) * 397) ^ mapHash;
}

?

I'm really unsure if that simple XOR is enough.

`s/concatenate/XOR` and you should be good to go. Then only things a hash function must be able to do is generate the same hash for two semantically equivalent values and distribute its output reasonably evenly over the set of all possible hash values. — The Paramagnetic Croissant, Jun 28 '15 at 10:09
Basically your question is how to get a hash for a (unordered) range of values and actually is not specific to `std::unordered_map`? — Stephan Dollberg, Jun 28 '15 at 10:30
"is enough" what do you mean? How do you define "enough"? No collisions at all? — BartoszKP, Jun 28 '15 at 10:30
Well, "enough" means here that it satisfies the conditions for a hash function same as defined for std::hash: http://en.cppreference.com/w/cpp/utility/hash. — Mike Lischke, Jun 28 '15 at 11:07

score 9 · Accepted Answer · edited Jun 20 '20 at 09:12

Response

If by enough, you mean whether or not your function is injective, the answer is No. The reasoning is that the set of all hash values your function can output has cardinality 2^64, while the space of your inputs is much larger. However, this is not really important, because you can't have an injective hash function given the nature of your inputs. A good hash function has these qualities:

It's not easily invertible. Given the output k, it's not computationally feasible within the lifetime of the universe to find m such that h(m) = k.
The range is uniformly distributed over the output space.
It's hard to find two inputs m and m' such that h(m) = h(m')

Of course, the extents of these really depend on whether you want something that's cryptographically secure, or you want to take some arbitrary chunk of data and just send it some arbitrary 64-bit integer. If you want something cryptographically secure, writing it yourself is not a good idea. In that case, you'd also need the guarantee that the function is sensitive to small changes in the input. The std::hash function object is not required to be cryptographically secure. It exists for use cases isomorphic to hash tables. CPP Rerefence says:

For two different parameters k1 and k2 that are not equal, the probability that std::hash<Key>()(k1) == std::hash<Key>()(k2) should be very small, approaching 1.0/std::numeric_limits<size_t>::max().

I'll show below how your current solution doesn't really guarantee this.

Collisions

I'll give you a few of my observations on a variant of your solution (I don't know what your _class member is).

std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
    std::hash<std::string> h;
    std::size_t result = 0;
    for (auto&& p : m) {
        result ^= h(p.first) ^ h(p.second);
    }
    return result;
}

It's easy to generate collisions. Consider the following maps:

std::unordered_map<std::string, std::string> container0;
std::unordered_map<std::string, std::string> container1;
container0["123"] = "456";
container1["456"] = "123";
std::cout << hash_code(container0) << '\n';
std::cout << hash_code(container1) << '\n';

On my machine, compiling with g++ 4.9.1, this outputs:

1225586629984767119
1225586629984767119

The question as to whether this matters or not arises. What's relevant is how often you're going to have maps where keys and values are reversed. These collisions will occur between any two maps in which the sets of keys and values are the same.

Order of Iteration

Two unordered_map instances having exactly the same key-value pairs will not necessarily have the same order of iteration. CPP Rerefence says:

For two parameters k1 and k2 that are equal, std::hash<Key>()(k1) == std::hash<Key>()(k2).

This is a trivial requirement for a hash function. Your solution avoids this because the order of iteration doesn't matter since XOR is commutative.

A Possible Solution

If you don't need something that's cryptographically secure, you can modify your solution slightly to kill the symmetry. This approach is okay in practice for hash tables and the like. This solution is also independent of the fact that order in an unordered_map is undefined. It uses the same property your solution used (Commutativity of XOR).

std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
    const std::size_t prime = 19937;
    std::hash<std::string> h;
    std::size_t result = 0;
    for (auto&& p : m) {
        result ^= prime*h(p.first) + h(p.second);
    }
    return result;
}

All you need in a hash function in this case is a way to map a key-value pair to an arbitrary good hash value, and a way to combine the hashes of the key-value pairs using a commutative operation. That way, order does not matter. In the example hash_code I wrote, the key-value pair hash value is just a linear combination of the hash of the key and the hash of the value. You can construct something a bit more intricate, but there's no need for that.

Aha, that's close to what I expected. "base" is probably a prime number and arbitrary, right? Of course this is not for any type of cryptographic support. I assumed that would implicitly clear from the use of std::hash. — Mike Lischke, Jun 28 '15 at 11:22
Yes, I chose 19937 because 2^19937 - 1 is my favorite Mersenne primes. — user123, Jun 28 '15 at 11:23
I may be confused, but couldn't this give you two distinct hash values for two equal maps if they weren't being iterated in the same order? (i.e. isn't this hash order dependent?) — Hasturkun, Jun 28 '15 at 11:46
@MikeLischke Have a look at the updated answer, I found that the key-value hash combining should be commutative. — user123, Jun 28 '15 at 12:11
Nice and comprehensive answer but I think that the first section is a little misleading. To my knowledge, the C++ standard never claims `std::hash` should be a cryptographic hash function, so if you write your own container hash based upon `std::hash`, you wouldn't expect that to be cryptographically secure either. For its intended use as key generator for hash tables, such security isn't needed either and wouldn't warrant the additional cost. However, your last bullet point is relevant in defeating DOS attacks. — 5gon12eder, Jun 28 '15 at 12:37
@5gon12eder The fact that this was written in between question edits imposes the need for a small reformulation :/ (I'll address that right away) — user123, Jun 28 '15 at 12:49
Fair point, I think the edit makes it clearer. You already got my up-vote earlier. ;-) — 5gon12eder, Jun 28 '15 at 12:58

Hash value for a std::unordered_map

1 Answers1

Response

Collisions

Order of Iteration

A Possible Solution