TL;DR - In my application many threads grab a ReentrantReadWriteLock in READ mode while they are inserting entries into a ConcurrentHashMap via the compute() method, and release the READ lock once the lamdba passed to compute() has finished. There's a separate thread that grabs the ReentrantReadWriteLock in WRITE mode and very (very) quickly releases it. While all this is happening, the ConcurrentHashMap is resizing (growing AND shrinking). I encounter a hang and I always see ConcurrentHashMap::transfer(), which is called during resizing, in the stack traces. All threads are blocked waiting to grab MY ReentrantReadWriteLock. Reproducer at: https://github.com/rumpelstiltzkin/jdk_locking_bug
Am I doing something wrong as per documented behavior, or is this a JDK bug? Note that I'm NOT asking for other ways to implement my application.
Details: Here's some context around why my application is doing what it's doing. The reproducer code is a pared down version to demonstrate the problem.
My application has a write-through cache. Entries get inserted into the cache with a timestamp of when they are inserted and separate flusher-thread iterates the cache to find entries that were created after the last time the flusher-thread persisted entries to disk i.e. after last-flush-time. The cache is nothing but a ConcurrentHashMap.
Now, a race is possible whereby an entry gets constructed with a timestamp tX and while it is being inserted into the ConcurrentHashMap the flusher-thread iterates the cache and does not find the entry (it is still being inserted so not yet visible in the flusher-thread's Map::Iterator) and so it does not persist it and bumps last-flush-time to tY such that tY > tX. The next time the flusher-thread iterates the cache it will not deem the tX-timestamped entry as needing to be flushed and we will miss persisting it. Eventually tX will be a very old timestamp and the cache will evict it permanently losing that update.
To get around this problem, threads that update the cache with new entries grab a ReentrantReadWriteLock in READ mode within the lambda that constructs the cache entry inside the ConcurrentHashMap::compute() method, and the flusher-thread grabs the same ReentrantReadWriteLock in WRITE mode when grabbing its last-flush-time. This ensures that when the flusher-thread grabs a timestamp all objects are "visible" in the Map, and have a timestamp <= the last-flush-time.
Reproduction on my system:
$> java -version
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
$> ./runtest.sh
seed is 1571855560640
Main spawning 100 readers
Main spawned 100 readers
Main spawning a writer
Main spawned a writer
Main waiting for threads ... <== hung
All threads (readers and the writer) blocked waiting for 0x00000000c6511648
$> ps -ef | grep java | grep -v grep
user 54896 54895 0 18:32 pts/1 00:00:07 java -ea -cp target/*:target/lib/* com.hammerspace.jdk.locking.Main
$> jstack -l 54896 > jstack.1
$> grep -B3 'parking to wait for <0x00000000c6511648>' jstack.1 | grep tid | head -10
"WRITER" #109 ...
"READER_99" ...
...
'top' shows my java process has been sleeping for minutes (it uses a tiny bit of CPU incrementally for possibly context switching and what not - see top's man page for more explanation why this happens)
$> top -p 54896
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
54896 user 20 0 4630492 103988 12628 S 0.3 2.7 0:07.37 java -ea -cp target/*:target/lib/* com.hammerspace.jdk.locking.Main