0

in product environment, one node (172.11.11.36) log show:

[..common.ignite.spi.CustomTcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/172.11.11.49:53137, rmtPort=53137 [2021-12-14T15:25:21,681][ERROR][sys-stripe-15-#16][org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] Failed to send message to remote node [node=TcpDiscoveryNode [id=f6fe6cd0-612b-4a26-8b63-2054b749fe7f, consistentId=node-live-39, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.11.11.39], sockAddrs=HashSet [ip-172-11-11-39.ap-northeast-1.compute.internal/172.11.11.39:47500, ip-172-17-0-1.ap-northeast-1.compute.internal/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=5, intOrder=5, lastExchangeTime=1638264334577, loc=false, ver=2.9.1#20201203-sha1:adce517, isClient=false], msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=1, arr=[107382444]]]]] org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: Remote node does not observe current node in topology : f6fe6cd0-612b-4a26-8b63-2054b749fe7f at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3819) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3635) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createCommunicationClient(TcpCommunicationSpi.java:3375) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3180) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:3013) [ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2960) [ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:2100) [ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:2195) [ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1257) [ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1296) [ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.sendDeferredUpdateResponse(GridDhtAtomicCache.java:3643) [ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$3300(GridDhtAtomicCache.java:141) [ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$DeferredUpdateTimeout.run(GridDhtAtomicCache.java:3889) [ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:565) [ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.9.1.jar:2.9.1] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

and node (172.11.11.39)

[2021-12-14T15:25:21,641][WARN ][disco-event-worker-#71][org.apache.ignite.internal.managers.discovery.GridDiscoveryManager] Node FAILED: TcpDiscoveryNode [id=702a3e0f-afc9-446e-9c9d-7ec25b185b49, consistentId=node-live-36, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.11.11.36], sockAddrs=HashSet [ip-172-11-11-36.ap-northeast-1.compute.internal/172.11.11.36:47500, ip-172-17-0-1.ap-northeast-1.compute.internal/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1638264334663, loc=false, ver=2.9.1#20201203-sha1:adcce517, isClient=false] [2021-12-14T15:25:21,680][WARN ][grid-nio-worker-tcp-comm-6-#45][org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] Close incoming connection, unknown node [nodeId=702a3e0f-afc9-446e-9c9d-7ec25b185b49, ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=6, bytesRcvd=437336562986, bytesSent=474752492909, bytesRcvd0=1781892, bytesSent0=1106881, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-6, igniteInstanceName=null, finished=false, heartbeatTs=1639495521670, hashCode=1976943565, interrupted=false, runner=grid-nio-worker-tcp-comm-6-#45]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=38 lim=38 cap=32768], inRecovery=null, outRecovery=null, closeSocket=true, outboundMessagesQueueSizeMetric=o.a.i.i.processors.metric.impl.LongAdderMetric@69a257d1, super=GridNioSessionImpl [locAddr=/172.11.11.39:47100, rmtAddr=/172.11.11.36:49818, createTime=1639495521670, closeTime=0, bytesSent=18, bytesRcvd=42, bytesSent0=18, bytesRcvd0=42, sndSchedTime=1639495521670, lastSndTime=1639495521670, lastRcvTime=1639495521670, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=o.a.i.i.util.nio.GridDirectParser@4f37de39, directMode=true], GridConnectionBytesVerifyFilter], accepted=true, markedForClose=false]]] [2021-12-14T15:25:21,673][ERROR][query-#105][org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi] Failed to send message to remote node [node=TcpDiscoveryNode [id=702a3e0f-afc9-446e-9c9d-7ec25b185b49, consistentId=node-live-36, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.11.11.36], sockAddrs=HashSet [ip-172-11-11-36.ap-northeast-1.compute.internal/172.11.11.36:47500, ip-172-17-0-1.ap-northeast-1.compute.internal/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1638264334663, loc=false, ver=2.9.1#20201203-sha1:adcce517, isClient=false], msg=GridIoMessage [plc=10, topic=TOPIC_QUERY, topicOrd=19, ordered=false, timeout=0, skipOnTimeout=false, msg=GridQueryNextPageResponse [qryReqId=78777738, segmentId=0, qry=2, page=0, allRows=364, cols=4, retry=null, retryCause=null, last=true, removeMapping=false, valsSize=1456, rowsSize=0]]] org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: Failed to send message (node left topology): TcpDiscoveryNode [id=702a3e0f-afc9-446e-9c9d-7ec25b185b49, consistentId=node-live-36, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.11.11.36], sockAddrs=HashSet [ip-172-11-11-36.ap-northeast-1.compute.internal/172.11.11.36:47500, ip-172-17-0-1.ap-northeast-1.compute.internal/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1638264334663, loc=false, ver=2.9.1#20201203-sha1:adcce517, isClient=false] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioSession(TcpCommunicationSpi.java:3736) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3635) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createCommunicationClient(TcpCommunicationSpi.java:3375) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:3180) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:3013) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2960) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:2100) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:2195) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1257) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.map(GridDhtLockFuture.java:1026) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.onOwnerChanged(GridDhtLockFuture.java:714) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheMvccManager.notifyOwnerChanged(GridCacheMvccManager.java:227) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheMvccManager.access$200(GridCacheMvccManager.java:82) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheMvccManager$3.onOwnerChanged(GridCacheMvccManager.java:164) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.checkOwnerChanged(GridCacheMapEntry.java:4935) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.checkOwnerChanged(GridCacheMapEntry.java:4887) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.GridDistributedCacheEntry.readyLock(GridDistributedCacheEntry.java:516) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.readyLocks(GridDhtLockFuture.java:622) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLockFuture.map(GridDhtLockFuture.java:830) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter.lockAllAsync(GridDhtTransactionalCacheAdapter.java:1274) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter.processNearLockRequest0(GridDhtTransactionalCacheAdapter.java:815) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter.processNearLockRequest(GridDhtTransactionalCacheAdapter.java:800) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter.access$000(GridDhtTransactionalCacheAdapter.java:112) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter$3.apply(GridDhtTransactionalCacheAdapter.java:158) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter$3.apply(GridDhtTransactionalCacheAdapter.java:156) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1907) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1528) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:241) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1421) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:565) ~[ignite-core-2.9.1.jar:2.9.1] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) ~[ignite-core-2.9.1.jar:2.9.1] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

and then node (172.11.11.36) shows [..common.ignite.spi.CustomTcpDiscoverySpi] Initialized connection with remote server node [nodeId=ad21f9e2-cfd0-44b2-821f-7be19184b3d8, rmtAddr=/172.11.11.21:59943] [2021-12-14T15:25:23,215][WARN ][tcp-disco-msg-worker-[ad21f9e2 172.11.11.37:47500 crd]-#2-#67][..common.ignite.spi.CustomTcpDiscoverySpi] Node is out of topology (probably, due to short-time network problems). [2021-12-14T15:25:23,216][WARN ][disco-event-worker-#69][org.apache.ignite.internal.managers.discovery.GridDiscoveryManager] Local node SEGMENTED: TcpDiscoveryNode [id=702a3e0f-afc9-446e-9c9d-7ec25b185b49, consistentId=node-live-36, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, 172.17.0.1, 172.11.11.36], sockAddrs=HashSet [ip-172-11-11-36.ap-northeast-1.compute.internal/172.11.11.36:47500, ip-172-17-0-1.ap-northeast-1.compute.internal/172.17.0.1:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1639495523214, loc=true, ver=2.9.1#20201203-sha1:adcce517, isClient=false] [2021-12-14T15:25:23,228][ERROR][disco-event-worker-#69][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SEGMENTATION, err=null]]

and then node(36) shows

[2021-12-14T15:25:23,240][ERROR][node-stopper][] Stopping local node on Ignite failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]] and this node was shutdown completely;

by the time ,i checked the log and confirmed network works well (this node could connect to other servers and other servers could connect to this node and exchange partition data,and other client node can connect to this node in order to execute query task) ;

but i don't know why others server node's show the same error log(Close incoming connection, unknown node) and cause the node shutdown ;

anybody knows the root cause; and how to prevent this thing happen again.

puddor
  • 1
  • 1

1 Answers1

0

Network problems like this have two common causes:

  1. A network problem(!)
  2. A long JVM pause

You don't show in your logs what happened before the errors, but there's a good chance you'll see warnings about a "Long JVM pause," which means that no Ignite code was being executed for a period of time. In this case, it means that messages from other nodes were not being handled. There are a number of causes for long pauses, but the most common is incorrectly configured garbage collectors. See the documentation for some hints.

Stephen Darlington
  • 51,577
  • 12
  • 107
  • 152