ReentrantReadWriteLock hang when used within ConcurrentHashMap::compute()

Question

TL;DR - In my application many threads grab a ReentrantReadWriteLock in READ mode while they are inserting entries into a ConcurrentHashMap via the compute() method, and release the READ lock once the lamdba passed to compute() has finished. There's a separate thread that grabs the ReentrantReadWriteLock in WRITE mode and very (very) quickly releases it. While all this is happening, the ConcurrentHashMap is resizing (growing AND shrinking). I encounter a hang and I always see ConcurrentHashMap::transfer(), which is called during resizing, in the stack traces. All threads are blocked waiting to grab MY ReentrantReadWriteLock. Reproducer at: https://github.com/rumpelstiltzkin/jdk_locking_bug

Am I doing something wrong as per documented behavior, or is this a JDK bug? Note that I'm NOT asking for other ways to implement my application.

Details: Here's some context around why my application is doing what it's doing. The reproducer code is a pared down version to demonstrate the problem.

My application has a write-through cache. Entries get inserted into the cache with a timestamp of when they are inserted and separate flusher-thread iterates the cache to find entries that were created after the last time the flusher-thread persisted entries to disk i.e. after last-flush-time. The cache is nothing but a ConcurrentHashMap.

Now, a race is possible whereby an entry gets constructed with a timestamp tX and while it is being inserted into the ConcurrentHashMap the flusher-thread iterates the cache and does not find the entry (it is still being inserted so not yet visible in the flusher-thread's Map::Iterator) and so it does not persist it and bumps last-flush-time to tY such that tY > tX. The next time the flusher-thread iterates the cache it will not deem the tX-timestamped entry as needing to be flushed and we will miss persisting it. Eventually tX will be a very old timestamp and the cache will evict it permanently losing that update.

To get around this problem, threads that update the cache with new entries grab a ReentrantReadWriteLock in READ mode within the lambda that constructs the cache entry inside the ConcurrentHashMap::compute() method, and the flusher-thread grabs the same ReentrantReadWriteLock in WRITE mode when grabbing its last-flush-time. This ensures that when the flusher-thread grabs a timestamp all objects are "visible" in the Map, and have a timestamp <= the last-flush-time.

Reproduction on my system:

$> java -version
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

$> ./runtest.sh 
seed is 1571855560640
Main spawning 100 readers
Main spawned 100 readers
Main spawning a writer
Main spawned a writer
Main waiting for threads ... <== hung

All threads (readers and the writer) blocked waiting for 0x00000000c6511648

$> ps -ef | grep java | grep -v grep
user   54896  54895  0 18:32 pts/1    00:00:07 java -ea -cp target/*:target/lib/* com.hammerspace.jdk.locking.Main

$> jstack -l 54896 > jstack.1

$> grep -B3 'parking to wait for  <0x00000000c6511648>' jstack.1  | grep tid | head -10
"WRITER" #109 ...
"READER_99" ...
...

'top' shows my java process has been sleeping for minutes (it uses a tiny bit of CPU incrementally for possibly context switching and what not - see top's man page for more explanation why this happens)

$> top -p 54896
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                               
 54896 user      20   0 4630492 103988  12628 S   0.3  2.7   0:07.37 java -ea -cp target/*:target/lib/* com.hammerspace.jdk.locking.Main

The ticket tracking this is https://bugs.openjdk.java.net/browse/JDK-8240646 — Anand Ganesh, Aug 14 '20 at 07:28

Elisha Ebenezer · Answer 1 · 2019-10-30T14:35:12.493

0

Note: Summarized below is the list of observations, suggested approach and an advice to file a bug with Oracle. Not the solution.

Observations

Concurrent Maps has their on inbuilt locking mechanism we are not needed to acquire one ourselves
Atomic* classes return within a "single" cpu cycle and hence no need of acquiring a lock when dealing with them
In Cache.java, you are obtaining (your own) ReadLock for updating the cache (Line 34), and (your own) WriteLock for reading from the map (line 58), and not acquiring any lock when you are actually removing the mapping (line 71).
Concurrent Maps's iterators are weakly consistent and they will not see your updates, even if the insertion is completed. it is by design.
I've restored AtomicInteger, as I did not want to use Holder (coming from jax-ws), and I was not able to reproduce your thread blocking.
Given that you are starting ReadLock acquiring threads prior to starting WriteLock acquiring threads. WriteLock acquiring threads will never get a chance to run as there are already a bunch of already threads that have already acquired Read Locks.
Introducing a 1 sec sleep in Cache#update method after releasing the ReadLock gave chance for the WriteLock acquiring threads to run.
I've reverted my updates and am able to reproduce your issue. But I did see a pattern.

a. Using Holder for lockCount got the system to crawl within no time.

b. Using AtomicInteger for lockCount prolonged the life for a few more seconds

c. Introducing a console statement about acquiring and releasing locks with runnable's id prolonged the life by a minute or two.

d. Replacing the Id with current thread's name in console output has completely taken away the problem.

With this, it clearly looks like a timing issue where race is happening for acquiring readlock and writelock as both readers and writer are waiting for acquiring respective locks resulting in a deadlock, and delay introduced with the extra statements is reducing the chances of this.

Suggested Approach

Given that ConcurrentHashMap comes with its own locking mechanism, you can stop using your own Reentrant lock while dealing with it.
Update your code to allow the WriteLock acquirers a chance to run :)
Check your version of Java, as I never got to a blocked state while running on Java 1.8.0_201

edited Oct 30 '19 at 14:35

answered Oct 23 '19 at 02:03

Elisha Ebenezer

57
6

Thanks for your comment Elisha. I'm not looking for a different implementation of my approach. I'm looking to find out why the reproducer hangs. #1 - definitely, but I suspect you haven't understood why I am using my own locks (it is not to guarantee atomicity of insertion into the cache) Ack on #2 - again, the lock is not for the atomic. Let me remove it. #3 is correct observation. Ack on #4 observation. I don't mind not seeing the update in a given iteration as long as the next iteration will let me see it, and as long as that entry's timestamp is after my writer's timestamp. – Anand Ganesh Oct 23 '19 at 02:39
I've force-pushed a change which removes the AtomicInteger to count the number of times the lambda was applied. Now it's a simple Holder (which I must use because I'm updating the value within a lambda). Btw, you can remove the 3rd commit which introduces the counter (to test the theory that the ConcurrentHashMap implementation might be applying my lambda more than once) and still easily reproduce the hang. – Anand Ganesh Oct 23 '19 at 02:57
I've cloned your repo and ran it *without* any blocking of Threads. It's strange that you are seeing this. I am running on Java 1.8.0_201 64-bit. One important observation though is that Write Lock was *never* acquired, and this is expected as one cannot acquire a Write Lock while there are any acquired Read Locks. – Elisha Ebenezer Oct 23 '19 at 11:27
The java version could be the difference in our experience. – Anand Ganesh Oct 23 '19 at 18:46
I've updated my question with output of commands reproducing the problem. I'll try to reproduce it on version 1.8.0_201 but it will help if you run my program without any modification and show me what you get with those same commands. If your threads are all sleeping and there's no activity at all, then it means they are hung and hitting the issue I'm reporting. Thanks! – Anand Ganesh Oct 23 '19 at 19:00
Nice experiments, Elisha. Thanks for trying them and posting your results. It is very much possible that (d) isn't completely taking away the problem, rather prolonging it longer than you waited. I'm also glad I don't have to repro with 1.8.0_201. So, given all of this, any ideas as to what's causing the deadlock? :-) Notice that in the "jstack -l" output, which shows all threads permanently waiting/hung on the ReentrantReadWriteLock, there is _no_ thread which says it _holds_ that lock. – Anand Ganesh Oct 24 '19 at 05:02
I waited for almost 2 hours with 8 (d) and issue did not occur. You may want try this with latest 1.8 build, and file a bug with Oracle. – Elisha Ebenezer Oct 24 '19 at 06:23
Elisha, do you mind updating your answer to state that you are not providing any answers? I appreciate the observations and the reproduction attempts you did, but your suggested approach is misleading and is not the answer to my question. I want to make sure people who visit my question don't think that it has been answered and move on without looking at it, especially because it's a long read to provide the required details. Thanks. – Anand Ganesh Oct 29 '19 at 21:09

ReentrantReadWriteLock hang when used within ConcurrentHashMap::compute()

1 Answers1

Observations

Suggested Approach