5

Since language standards rarely mandate implementation methods, I'd like to know what is the real world hashing method used by C++ standard library implementations (libc++, libstdc++ and dinkumware).

In case it's not clear, I expect the answer to be a method like these :

  • Hashing with chaining
  • Hashing by Division / Multiplication
  • Universal hashing
  • Perfect hashing (static, dynamic)
  • Hashing with open addressing (linear/quadratic probing or double hashing)
  • Robin-Hood hashing
  • Bloom Filters
  • Cuckoo hashing

Knowing why a particular method was chosen over the others would be a good thing as well.

Nikos Athanasiou
  • 29,616
  • 15
  • 87
  • 153
  • @JoachimPileborg "the data is organized in buckets" means chaining (or at least that we have to exclude open addressing) right ? – Nikos Athanasiou Jul 07 '15 at 11:08
  • See [N1456](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1456.html), III "Design Decisions" "B. Chaining Versus Open Addressing" – dyp Jul 07 '15 at 11:21
  • You should not care... Why do you ask? If you care, look into the source code of your implementation. – Basile Starynkevitch Jul 13 '15 at 14:06
  • @BasileStarynkevitch I want to know whether any algorithmic approach fits better a specific language to begin with. Secondly I hope to get clues on the maturity of the methods by knowing about real world implementations. I do look into my implementation but I have limited time; anyone that knows these things (even if they're current state and subject to changes) could help – Nikos Athanasiou Jul 13 '15 at 14:11
  • This question is way, way too broad. Not only will your question attract (low quality) answer for every future implementation ever to be produced, you widened the field some more by adding in more language tags. Oh, and you'd like to hear the motivation for each method too. Please narrow this down *severely*; I've removed the extra language tags. – Martijn Pieters Jul 13 '15 at 17:34
  • @MartijnPieters Do you consider the existing answer to be of low quality ? – Nikos Athanasiou Jul 13 '15 at 17:55

1 Answers1

6
  • libstdc++: Chaining, only power-of-two table size, default (if it is even configurable) load threshold for rehashing is 1.0, buckets are all separate allocations. Outdated. I don't know current state of things.
  • Rust: Robin Hood, default load threshold for rehashing is 0.9 (too much for open addressing, BTW)
  • Go: table slots point to "bins" of 5(7?) slots, not sure what happens if bin is full, AFAIR it is growing in a vector/ArrayList manner
  • Java: chaining, only power-of-two table size, default load threshold is 0.75 (configurable), buckets (called entries) are all separate allocations. In recent versions of Java, above a certain threshold, chains are changed to binary search trees.
  • C#: chaining, buckets are allocated from a flat array of bucket structures. If this array is full, it is rehashed (with the table, I suppose) in a vector/ArrayList manner.
  • Python: open addressing, with own unique collision-resolution scheme (not very fortunate, IMHO), only power-of-two table sizes, load threshold for rehashing is 0.666.. (good). However, slot data in a separate array of structures (like in C#), i. e. hash table operations touch at least two different random memory locations (in the table and in the array of slot data)

If some points missed in descriptions, it doesn't mean they are absent, it means I don't know/remember details.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
leventov
  • 14,760
  • 11
  • 69
  • 98
  • +1 Could you provide some links on how you got the info? This answers even more than I asked for, if you are solid on what you say I could retag my question to include the languages you mention (seriously how do you know all this?) – Nikos Athanasiou Jul 07 '15 at 14:18
  • Python/C#: there are very elaborate articles/talks explaining it, was not looking in source. C++/Java/Rust/Go: was looking into source. Get sources and dig. It was hard sometimes, not going to repeat this for providing links. I'm sure about everything except Go. – leventov Jul 07 '15 at 14:22
  • After inserting 10 items into a libstdc++ `unordered_set`, the container reports having a table size of 11: http://melpon.org/wandbox/permlink/l0YoyCgQNLSiyQox 11 is not a power of two. – Howard Hinnant Jul 07 '15 at 14:32
  • @HowardHinnant indeed. I was looking into glibc++, unordered_map implementation, a couple of years ago. It either means: they changed algorithm; unordered_set uses different algorithm than unordered_map. The latter seems sensible, because having a separate allocation for just holding a key is more wasteful, than key+value, + keys tend to be small, values larger. – leventov Jul 07 '15 at 14:39
  • @HowardHinnant same for unordered_map, now. – leventov Jul 07 '15 at 14:42