28

I know the difference between Open Addressing and Chaining for resolving hash collisions . Most of the basic hash based data structures like HashSet,HashMap in Java primarily use chaining technique. I read that ThreadLocal actually uses a probing scheme . So I want to understand why is open addressing not so much used in Java ? I mean it would be difficult to delete records using that scheme , in the sense that you have to mark those cells with some special handling . However it seems like memory requirement will be low for open addressing scheme.

Edit : I just want to understand the possible major reason/reasons for this design decision . I do not want finer details . Also I would like to know why ThreadLocal uses the lesser common technique of open addressing . I guess the two answers can be related together . So I prefer to ask in the same question itself.

Geek
  • 26,489
  • 43
  • 149
  • 227
  • 6
    This question would be best asked to the designers of `HashMap`: Doug Lea, Josh Bloch, Arthur van Hoff and Neal Gafter. I doubt anyone here will be able to tell you what their exact reasoning behind the decision was. – Jeffrey Aug 18 '12 at 14:48
  • 1
    @Jeffrey I am just looking for the intuition here behind the design decision . I don't want finer details . Just like why Java supported multiple interface implementation and only single inheritance ? – Geek Aug 18 '12 at 14:57
  • Is there anything in the specification of `java.util.HashMap` that requires implementations to chain instead of double-hashing? – Mike Samuel Aug 18 '12 at 15:07
  • @MikeSamuel It seems to me like it is more convenient at the cost of more memory but I am not sure and this is why I asked this question. But I do not know anything in HashMap specification that would not allow it to use double hashing . – Geek Aug 18 '12 at 15:12

1 Answers1

20

I am currently discussing memory-compact reimplementations of HashMap and HashSet with, among others, Doug Lea. This particular question hasn't come up, but here's my first thoughts on the question...

  • Chained hash tables degrade reasonably gracefully. Whether it's higher load factors or lots of hash collisions, chaining doesn't degrade nearly as quickly as open addressing can.
  • As you've said, remove is...not a pleasant operation on open-addressed tables. As a general rule, remove is the least common operation on hash tables, but there are applications for which it's more common, and bad performance would be noticed.
  • I also suspect -- though I don't have much data -- that implementing a "linked" open-addressed hash table would be noticeably more difficult. LinkedHashMap is written as a subclass of HashMap, and borrows most of the implementation details; it's somewhat easier to implement the linked list of entries when the entries are discrete objects -- and at that point, you're already most of the way to a chained implementation.
  • Nothing in the spec ties them to this implementation -- they're always free to mess around with it later.
  • The JDK collections libraries...don't make memory consumption an especially high priority. Memory is cheap. (You may or may not agree with this, but it's definitely a noticeable trend.)
Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • great to hear about your current work profile . can ypu explain this a little bit " Whether it's higher load factors or lots of hash collisions, chaining doesn't degrade nearly as quickly as open addressing can. " I know what load factor are but unable to see how chaing dregrades gracefully but open addressing doesn't. – Geek Aug 18 '12 at 16:30
  • also do you know or have any intuition of why ThreadLocals use open addressing technique and not the good old chaining technique ? – Geek Aug 18 '12 at 16:32
  • @Geek: if the load factor is, say, 0.95, then when you search for an absent key in an open-addressed hash table, on average you have to traverse 20 hash table positions before you can be sure that there's no entry associated with that key; for a chained hash table you need to traverse ~1 entry on average. With regards to `ThreadLocal`, I suspect the advantage is that it's easier to "notice," and expunge, GC'd entries -- visiting more or less arbitrary entries is more common in open-addressed implementations, so you'll notice that GC happened. Eh. – Louis Wasserman Aug 18 '12 at 16:57
  • 3
    @LouisWasserman: Things are much worse than that. Consider a 1000-slot table in which 500 items map without collisions to the odd numbers, and 100 items map to zero. The load factor is only 0.6, but any not-found item whose hash value is in the range 0 to 199 will have to scan every item from that value up to 199. There will be a one-in-five chance of having a hash value in such a range, and hitting such a hash values will require scanning an average of 100 items. Thus, one ends up having to scan an average of 20 items even with a load factor of only 60%. – supercat Feb 18 '14 at 22:24