0

This is regarding a rather recent issue that we’ve been facing. We run 2 client instances and 26 apache ignite instances. All are AWS R4.2xLarge nodes. Recently we’ve been seeing this issue where when trying to fetch an atomicLong or atomicReference, the executing thread gets stuck and doesn’t return. This issue usually happens on 1 or 2 ignite instances. I am not sure why this happens and so any help on this would be really appreciated.

This is the thread dump while trying to get an atomicReference:

"main" #1 prio=5 os_prio=0 cpu=3528.41ms elapsed=1067.33s allocated=312M defined_classes=9309 tid=0x00007f4ce4046fc0 nid=0x1537 waiting on condition  [0x00007f4cece90000]
   java.lang.Thread.State: WAITING (parking)
                at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
                - parking to wait for  <0x00007f4cbfe7c7d0> (a java.util.concurrent.CountDownLatch$Sync)
                at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1039)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1345)
                at java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:232)
                at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
                at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
                at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
                at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicReference(DataStructuresProcessor.java:744)
                at org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3743)
                at org.apache.ignite.internal.IgniteKernal.atomicReference(IgniteKernal.java:3732)
                at company.explore.cache.persist.SavedAudienceLocationProvider.getSavedAudienceLocation(SavedAudienceLocationProvider.java:27)
                at company.explore.listeners.lifecycle.LifecycleListener.configureSavedAudienceLocation(LifecycleListener.java:45)
                at company.explore.listeners.lifecycle.LifecycleListener.onLifecycleEvent(LifecycleListener.java:38)
                at org.apache.ignite.internal.IgniteKernal.notifyLifecycleBeans(IgniteKernal.java:725)
                at org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1156)
                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
                - locked <0x00007f4cbf072a38> (a org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance)
                at org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
                at org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
                at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
                at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
                at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
                at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
                at org.apache.ignite.Ignition.start(Ignition.java:348)
                at org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)

Since this is stuck any Ignition.ignite calls fail as well and cause the job not to go through:

"pub-#22" #48 prio=5 os_prio=0 cpu=5.76ms elapsed=1036.50s allocated=421K defined_classes=6 tid=0x00007f4ce4cf3990 nid=0x1607 waiting on condition  [0x00007f40375f6000]
   java.lang.Thread.State: WAITING (parking)
                at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
                - parking to wait for  <0x00007f4cbf16d9e0> (a java.util.concurrent.CountDownLatch$Sync)
                at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1039)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1345)
                at java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:232)
                at org.apache.ignite.internal.util.IgniteUtils.awaitQuiet(IgniteUtils.java:7657)
                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.grid(IgnitionEx.java:1671)
                at org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1389)
                at org.apache.ignite.internal.IgnitionEx.grid(IgnitionEx.java:1258)
                at org.apache.ignite.Ignition.ignite(Ignition.java:489)
                at company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:58)
                at company.explore.dataload.person.LoadPersonAttributeJob.call(LoadPersonAttributeJob.java:31)
                at org.apache.ignite.internal.processors.closure.GridClosureProcessor$C2.execute(GridClosureProcessor.java:1855)
                at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
                at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
                at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
                at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
                at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
                at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)

Similarly this is an instance where the thread is waiting for CountDownLatch when trying to get atomicLong:

"pub-#489" #608 prio=5 os_prio=0 cpu=16.80ms elapsed=7076.10s allocated=2409K defined_classes=17 tid=0x00007f48c8014c60 nid=0x5bd5 waiting on condition  [0x00007f48359e1000]
   java.lang.Thread.State: WAITING (parking)
                at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
                - parking to wait for  <0x00007f518aba6060> (a java.util.concurrent.CountDownLatch$Sync)
                at java.util.concurrent.locks.LockSupport.park(java.base@11.0.7/LockSupport.java:194)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.7/AbstractQueuedSynchronizer.java:885)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1039)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.7/AbstractQueuedSynchronizer.java:1345)
                at java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:232)
                at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:7612)
                at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.awaitInitialization(DataStructuresProcessor.java:1147)
                at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.getAtomic(DataStructuresProcessor.java:506)
                at org.apache.ignite.internal.processors.datastructures.DataStructuresProcessor.atomicLong(DataStructuresProcessor.java:463)
                at org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3716)
                at org.apache.ignite.internal.IgniteKernal.atomicLong(IgniteKernal.java:3705)
                at company.explore.cache.persist.person.SerializationStatus.getSerializeCounter(SerializationStatus.java:86)
                at company.explore.cache.persist.person.SerializationStatus.startNodeSerialization(SerializationStatus.java:21)
                at company.explore.cache.persist.personv2.PersonSerializationJob.serializePeopleData(PersonSerializationJob.java:98)
                at company.explore.cache.persist.personv2.PersonSerializationJob.run(PersonSerializationJob.java:75)
                at org.apache.ignite.internal.processors.closure.GridClosureProcessor$C4.execute(GridClosureProcessor.java:1944)
                at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
                at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6817)
                at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
                at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
                at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
                at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)

These issues have only started coming up as of the past 2 months or so. The system itself has been very stable for a long time. I haven’t posted the entire thread dump as it would be quite large. If needed, I can post it on pastebin or upload it somewhere.

Since this really isn’t a very consistent issue I am not sure about how to create a reproducer project. But I can provide any logs or so if needed.

EDIT:

The entire thread dumps have been posted on pastebin. Please find the links below:

Atomic Reference related thread dump: pastebin.com/ydNMFSEP

Atomic Long related thread dump: pastebin.com/psJgwi3F

Paul Jose
  • 23
  • 5

0 Answers0