4

PS: Because many people in SO don't like discussing the motivation/trade-off of JDK implementation details, they think JDK engineers have a right to do it without telling anybody. (a previous post about JDK motivation has been closed), this question is purely about the HashMap algorithm & data structure and trade-off analysis/engineer consideration between two separate chaining implementations.

As we all know, we can use separate-chaining method to handle hash collision when implementing HashMap(every chain is a different linked list). In principle, when inserting a new element with hash collision, we can insert it into the head or tail of the linked list.

Both methods can work with the same worst-time complexity(since in both cases, we have to scan the whole linked list to check whether there is the same key, if not then we need to insert it. When we scan the whole linked list, we have had the head and tail.). However, when I learned the algorithm course, my teacher told us that we prefer to insert into the head since ,in general, more recently inserted elements have more chances to be looked up. For this reason, I've seen that all algorithm or data structure textbooks with pseudo-code or concrete implementation in any programming language choose to insert into the head. (e.g., Alogirhtms, Sedgewick code, Introduction to Algorithms, CLRS(page 258), etc.)

However, a few days ago, I saw the source code HashMap in JDK8. JDK8 chooses to insert into the tail, which is out of my expectation based on my knowledge(the line 611, 641, and putVal() method in JDK 8 source code). Then I checked JDK7 and found that JDK7 chooses to insert into the head as we usually learned. (line 402, line 766 and addEntry() method in JDK 7 source code)

My question:

In general, what's the trade-off between insertion into the head and insertion into the tail when implementing separate-chaining HashMap? Is there any practical engineer consideration(e.g. multi-thread)? (I've seen several blogs talking about insertion into the head may cause a dead loop if not synchronized properly.)

maplemaple
  • 1,297
  • 6
  • 24
  • Have you been read the summary of the blog you shared!! the reason is clearly mentioned there! . – Papai from BEKOAIL Jul 05 '21 at 08:53
  • There are no atomic operation in the link, so it will fail regardless of the insertion point. I see that if you need multithreaded access you should go for ConcurrentHashMap. – Surt Jul 05 '21 at 10:00
  • @PapaifromBEKOAIL That is the incorrect use of HashMap, not the reason for the modification to tail insertion. After changing to tail insertion, HashMap is still not thread-safe. – Poison Dec 23 '21 at 11:51

1 Answers1

1

in general, more recently inserted elements have more chances to be looked up

That might be true with additional assumptions. For example, if there is a large, long-living structure, possibly stored on a disk, older elements may be 'outdated' and not looked up anymore.

Here we are talking about a structure stored in memory, typically short-living, used to make some computations and deleted afterward. If there is no assumption about the relation between the insertion order and access frequency, there is no reason to assume that more recent elements are accessed more often.

Also, where a value is inserted has nothing to do with concurrency. In such cases, a synchronized, thread-safe structure like ConcurrentHashMap should be used, and both methods would work.

With that being said, the JDK can implement it either way. I think the choice that was more convenient and resulted in clearer code has been made. I guess JDK 7 inserts into the head because it reduces the complexity by avoiding the necessity of checking if there is already any value for the given hash in the table. JDK 8 has changed the implementation significantly. Here when inserting a new node, we are just after reaching the last node in the list, and it might look more natural to the author to write

if ((e = p.next) == null) {
    p.next = newNode(hash, key, value, null);

than

if ((e = p.next) == null) {
    tab[i] = newNode(hash, key, value, tab[i]);

but both ways would work fine.

Prectron
  • 282
  • 2
  • 11