4

We have Ignite running in server mode in our JVM. Ignite is going into deadlock in following scenario. I have added the thread stack at the end of this question

a.Create a cache with write through enabled
b.In CacheWriter.write() implementation 
    1.Wait for a second to for step c to be invoked
    2.Try to read from another cache
c. While step b is executing Trigger a thread which will create a new 
   cache. 
d.On executing above scenario, Ignite is going into deadlock as 
    1.Readlock has been acquired by cache.put() operation
    2.When cache creation is triggered in separate thread, Partition Map Exchange is also started
    3.PME tries to acquire all 16 locks , but wait as one Read lock is already acquire
    4.While reading from cache, cache.get() can not complete as it waits for current Partition Map   Exchange to complete

We have face this issue in production and above scenario is just its reproducer. Write Through implementation is just trying to read from cache and cache creation is happening in totally different thread

  1. Why Ignite is blocking all cache.get() operation for PME when it does not even have all required locks? Shouldn’t the call be blocked only after PME operation has all the locks?

  2. why PME stops everything? If I create cache A then only related operation for cache A or its cache group should be stopped

  3. Also is there any solution to solve this deadlock?

Thread executing cache.put() and write through

"main" #1 prio=5 os_prio=0 tid=0x0000000003505000 nid=0x43f4 waiting on condition [0x000000000334b000]
   java.lang.Thread.State: WAITING (parking)
               at sun.misc.Unsafe.park(Native Method)
               at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
               at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
               at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
               at org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:4870)
               at org.apache.ignite.internal.processors.cache.GridCacheAdapter.repairableGet(GridCacheAdapter.java:4830)
               at org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:1463)
               at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1128)
               at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:688)
               at ReadWriteThroughInterceptor.write(ReadWriteThroughInterceptor.java:70)
               at org.apache.ignite.internal.processors.cache.GridCacheLoaderWriterStore.write(GridCacheLoaderWriterStore.java:121)
               at org.apache.ignite.internal.processors.cache.store.GridCacheStoreManagerAdapter.put(GridCacheStoreManagerAdapter.java:585)
               at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6468)
               at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:6239)
               at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5923)
               at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:4041)
               at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$5700(BPlusTree.java:3935)
               at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2039)
               at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1923)
               at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1734)
               at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1717)
               at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:441)
               at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2327)
               at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2553)
               at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:2016)
               at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1833)
               at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1692)
               at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicAbstractUpdateFuture.sendSingleRequest(GridNearAtomicAbstractUpdateFuture.java:300)
               at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.map(GridNearAtomicSingleUpdateFuture.java:481)
               at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.mapOnTopology(GridNearAtomicSingleUpdateFuture.java:441)
               at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicAbstractUpdateFuture.map(GridNearAtomicAbstractUpdateFuture.java:249)
               at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update0(GridDhtAtomicCache.java:1147)
               at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.put0(GridDhtAtomicCache.java:615)
               at org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:2571)
               at org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:2550)
               at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.put(IgniteCacheProxyImpl.java:1337)
               at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.put(GatewayProtectedCacheProxy.java:868)
               at com.eqtechnologic.eqube.cache.tests.readerwriter.WriteReadThroughTest.writeToCache(WriteReadThroughTest.java:54)
               at com.eqtechnologic.eqube.cache.tests.readerwriter.WriteReadThroughTest.lambda$runTest$0(WriteReadThroughTest.java:26)
               at com.eqtechnologic.eqube.cache.tests.readerwriter.WriteReadThroughTest$$Lambda$1095/2028767654.execute(Unknown Source)
               at org.junit.jupiter.api.AssertDoesNotThrow.assertDoesNotThrow(AssertDoesNotThrow.java:50)
               at org.junit.jupiter.api.AssertDoesNotThrow.assertDoesNotThrow(AssertDoesNotThrow.java:37)
               at org.junit.jupiter.api.Assertions.assertDoesNotThrow(Assertions.java:3060)
               at WriteReadThroughTest.runTest(WriteReadThroughTest.java:24)

PME thread waiting for locks

"exchange-worker-#39" #56 prio=5 os_prio=0 tid=0x0000000022b91800 nid=0x450 waiting on condition [0x000000002866e000]
               java.lang.Thread.State: WAITING (parking)
                              at sun.misc.Unsafe.park(Native Method)
                              - parking to wait for  <0x000000076e73b428> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
                              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                              at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
                              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
                              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
                              at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lockInterruptibly(ReentrantReadWriteLock.java:998)
                              at org.apache.ignite.internal.util.StripedCompositeReadWriteLock$WriteLock.lock0(StripedCompositeReadWriteLock.java:192)
                              at org.apache.ignite.internal.util.StripedCompositeReadWriteLock$WriteLock.lockInterruptibly(StripedCompositeReadWriteLock.java:172)
                              at org.apache.ignite.internal.util.IgniteUtils.writeLock(IgniteUtils.java:10487)
                              at org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.updateTopologyVersion(GridDhtPartitionTopologyImpl.java:272)
                              at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.updateTopologies(GridDhtPartitionsExchangeFuture.java:1269)
                              at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1028)
                              at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3370)
                              at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3197)
                              at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
                              at java.lang.Thread.run(Thread.java:748)

2 Answers2

2

Technically, you have answered your question on your own, that is great work, to be honest.

You are not supposed to have blocking methods in your write-through cache store implementation that might get in conflict with PME or cause pool starvation.

You have to remember that PME is a show-stopper mechanism: the entire user load is stopped. In short, that is required to ensure ACID guarantees. The lock indeed is divided into multiple parts to speed up the processing, i.e. allowing up to 16 threads to perform cache operations concurrently. But a PME does need exclusive control over the cluster, thus it acquires a write lock over all the threads.

Shouldn’t the call be blocked only after PME operation has all the locks?

Yes, that's indeed how it's supposed to work. But in your case, PME tries to get the write lock, whereas the read lock is there, therefore it's waiting for its completion, and all further read locks are being queued after the write lock.

Also is there any solution to solve this deadlock?

  • move cache-related logic out of the CacheStore. Ideally, do not start caches dynamically, since that triggers PME. Have them created in advance if possible
  • check if other mechanisms like continuous-queries or entry processo would work.

But still, it all depends on your use case.

Alexandr Shapkin
  • 2,350
  • 1
  • 6
  • 10
  • Thanks for your response. I edited the question , the cache creation is not happening from cache store. Only cache read is happening. Cache creation happens in a different thread running parallelly – Atul Dhatrak Nov 29 '22 at 13:38
  • The reason cache.get() operation are waiting is topology update is running and its waiting for its completion. This topology update is happening inside PME. If we say that topology update operation would be considered running only when it has all required 16 locks, then cache.get() in this scenario would not be blocked. This would resolve deadlock we are facing – Atul Dhatrak Nov 29 '22 at 13:41
0

I don't think creating a cache inside the cache store will work. From the documentation for CacheWriter:

A CacheWriter is used for write-through to an external resource.

(Emphasis mine.)

Without knowing your use case, it's difficult to suggest an alternative approach, but creating your caches in advance or using a continuous query as a trigger works in similar situations.

Stephen Darlington
  • 51,577
  • 12
  • 107
  • 152
  • I edited the question, Cache creation is not happening inside the cache store. Parallel running thread is creating cache at the same time of write through execution – Atul Dhatrak Nov 29 '22 at 13:42
  • The write method is called _during_ a write, so even if you push the cache creation to another thread, it's still happening in the middle of a transaction. That's not going to work. As [Alexandr notes](https://stackoverflow.com/a/74601449/2998), dynamically creating caches like this is not a good pattern. – Stephen Darlington Nov 29 '22 at 16:07