11

I've been watching this video from CppCon 2014 and discovered that there is an interface to access buckets underneath std::unordered_map. Now I have a couple of questions:

  • Are there any reasonable examples of the usage of this interface?
  • Why did the committee decide to define this interface, why typical STL container interface wasn't enough?
πάντα ῥεῖ
  • 1
  • 13
  • 116
  • 190
Kostya
  • 1,536
  • 1
  • 13
  • 22
  • _"and discovered that there is an interface to access buckets underneath `std::unordered_map`."_ Please elaborate about this probably implementation specific detail in your question. – πάντα ῥεῖ Jun 29 '15 at 21:00
  • 2
    @πάνταῥεῖ naw, it's not implementation-specific. The standard interface of `std::unordered_map` exposes this implementation detail. It's horrible. – The Paramagnetic Croissant Jun 29 '15 at 21:01
  • 5
    @πάνταῥεῖ: it's part of the [standard library](http://en.cppreference.com/w/cpp/container/unordered_map/begin2). – Kerrek SB Jun 29 '15 at 21:02
  • @KerrekSB Though the question should be self contained, shouldn't it? – πάντα ῥεῖ Jun 29 '15 at 21:04
  • @TheParamagneticCroissant: what's so horrible? It requires a hashing function, it has amortized O(1) insert, average O(1) search... it's not like it's a mystery that it has to be implemented as a hash table, and it's useful to be able to exploit all its properties (compare with e.g. `std::priority_queue`, that substantially mandates a specific implementation but has a ridiculously limited interface, which makes it almost useless for any concrete usage). – Matteo Italia Jun 29 '15 at 21:35
  • 2
    @MatteoItalia Constant factors. The way its interface is formulated in the Standard means that effectively it must use linked lists for collision handling, which is one of the slowest methods to implement a hash table. The standard should have allowed for a greater flexibility so that library providers have the opportunity to make it as fast as they desire. – The Paramagnetic Croissant Jun 29 '15 at 21:50
  • @TheParamagneticCroissant ok, so the problem is not that they specified a bucket interface, but that they specified one that mandates an inefficient implementation. – Matteo Italia Jun 29 '15 at 21:54

4 Answers4

12

It is often enlightening to search for the proposal that introduced an item, as there is often an accompanying rationale. In this case N1443 says this:

G. Bucket Interface

Like all standard containers, each of the hashed containers has member function begin() and end(). The range [c.begin(), c.end()) contains all of the elements in the container, presented as a flat range. Elements within a bucket are adjacent, but the iterator interface presents no information about where one bucket ends and the next begins.

It's also useful to expose the bucket structure, for two reasons. First, it lets users investigate how well their hash function performs: it lets them test how evenly elements are distributed within buckets, and to look at the elements within a bucket to see if they have any common properties. Second, if the iterators have an underlying segmented structure (as they do in existing singly linked list implementations), algorithms that exploit that structure, with an explicit nested loop, can be more efficient than algorithms that view the elements as a flat range.

The most important part of the bucket interface is an overloading of begin() and end(). If n is an integer, [begin(n), end(n)) is a range of iterators pointing to the elements in the nth bucket. These member functions return iterators, of course, but not of type X::iterator or X::const_iterator. Instead they return iterators of type X::local_iterator or X::const_local_iterator. A local iterator is able to iterate within a bucket, but not necessarily between buckets; in some implementations it's possible for X::local_iterator to be a simpler data structure than X::iterator. X::iterator and X::local_iterator are permitted to be the same type; implementations that use doubly linked lists will probably take advantage of that freedom.

This bucket interface is not provided by the SGI, Dinkumware, or Metrowerks implementations. It is inspired partly by the Metrowerks collision-detection interface, and partly by earlier work (see [Austern 1998]) on algorithms for segmented containers.

Howard Hinnant
  • 206,506
  • 52
  • 449
  • 577
3

I imagine you can benefit greatly from this if you're in a high performance situation and collisions end up killing you. Iterating the buckets and looking @ the bucket size periodically could tell you if your hashing policy is good enough.

Unordered maps are greatly dependent on their hashing policy when it comes to performance.

NG.
  • 22,560
  • 5
  • 55
  • 61
  • 1
    If performance is a concern, then you shouldn't use `unordered_map` in the first place. – The Paramagnetic Croissant Jun 29 '15 at 21:08
  • 1
    @TheParamagneticCroissant Pretty much the only reason `unordered_map` exists is to provide performance advantages over ordered maps. – David Schwartz Jun 29 '15 at 21:22
  • 1
    @DavidSchwartz in which it fails helplessly because of its horrible allocate-memory-for-every-node nature (not to mention cache locality…). It's almost trivial to write a hash table with open addressing in C++ with 3-4-5-10 times the performance of `unordered_map`. – The Paramagnetic Croissant Jun 29 '15 at 21:24
  • 2
    @TheParamagneticCroissant For almost any of the C++ collections, you can argue that there's almost no problem for which it's the best possible solution. But there are a large number of real world problems for which it's good enough. – David Schwartz Jun 29 '15 at 21:28
  • @DavidSchwartz yes, that's true. I didn't assert the opposite. I often use `unordered_map` myself. I merely asserted that those scenarios are not the cases where performance-aware fine-tuning is/should be relevant. – The Paramagnetic Croissant Jun 29 '15 at 21:29
2

There is a number of algorithms which require the objects to be hashed into some number of buckets, and then each bucket is processed.

Say, you want to find duplicates in a collection. You hash all items in the collection, then in each bucket you compare items pairwise.

A bit less trivial example is Apriori algorithm for finding frequent itemsets.

Anton Savin
  • 40,838
  • 8
  • 54
  • 90
2

The only reason I have ever needed the interface is to traverse all the objects in a map without having to hold a lock on the map or copy the map. This can be used for imprecise expiration or other types of periodic checks on objects in the map.

The traverse works as follows:

  1. Lock the map.

  2. Begin traversing the map in bucket order, operating on each object you encounter.

  3. When you decide you've held the lock for too long, stash the key of the object you last operated on.

  4. Wait until you wish to resume operating.

  5. Lock the map, and go to step 2, starting at or near (in bucket order) the key you stopped on. If you reach the end, start back at the beginning.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278