3

I am using an std::unordered_map in C++11. I am deciding between string keys and a compound data type (like two longs put together in a struct to hold a UUID).

Is there an easy way to determine the performance characteristics of lookups, inserts, removes etc when the hashmap is using std::string keys versus when the hashmap is using some other simple data type for keys?

Once I've selected a data type: std::unordered_map's search, removal and insert operations are constant time in the number of elements in the map, but if I have a very long key (say, 128 bits), I start to wonder about the performance of these operations in the size of the key.

Is this something to be concerned about, or will the difference be negligible?

skyw
  • 349
  • 4
  • 15

1 Answers1

4

I think you've misunderstood the complexity guarantees of std::unordered_map's insert, removal and find operations. The worst case O(size()) mentioned only happens if you implement a terrible hash function for the Key type that generates lots of collisions, yet distinct keys do not compare equal.

Say you have

struct terrible_hash
{
  std::size_t operator()(int i) const
  { return 42; }
};

std::unordered_map<int, foo, terrible_hash> m;

All insertions of new keys into the map above will be O(m.size()) because the function will be forced to search linearly through each element since they all hash to the same value.

Given a decent hash function, those operations should be (amortized) constant time.

Going back to your question of string vs a 128-bit number (UUID) as the key type; it depends on your implementation, but typically the latter should be quicker. I say this based on the following assumptions:

  • Typical hash<string> specializations will iterate over the entire string and perform bitwise math on each byte and combine it with the existing result. For instance, partial/simplified implementation taken from VS2013:

    size_t _Val = 14695981039346656037ULL;
    for (size_t _Next = 0; _Next < _Count; ++_Next)
    {
      _Val ^= (size_t)_First[_Next];
      _Val *= 1099511628211ULL;
    }
    return _Val;
    
  • With your 128-bit key type, you should be able to combine the two 64-bit words to generate a hash with fewer operations. For example you could define a helper function template, and use it to combine the hashes from the 64-bit words.

    template <class T>
    inline void hash_combine(std::size_t& seed, const T& v)
    {
        std::hash<T> hasher;
        seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
    }
    

    The magic numbers are stolen from boost::hash_combine. Again, looking at the MSVC implementation for std::hash<uint64_t>, they alias into the 64-bit integer via an unsigned char * and call the algorithm I pasted above, but in this case the number of iterations is known and the compiler will be able to optimize better.

Having said all that, if performance is very important, you need to measure both choices for keys, and then make a decision.

Praetorian
  • 106,671
  • 19
  • 240
  • 328