Ignite networking failed

Question

I configure the static IP:

TcpDiscoverySpi spi = new TcpDiscoverySpi();`TcpDiscoveryVmIpFinder ipFinder = new TcpDiscoveryVmIpFinder();ipFinder.setAddresses(Arrays.asList("76.3.16.109", "76.3.16.110","76.3.16.111", "76.3.16.112", "76.3.16.113"));`

ignite log:

Failed to send message [node=TcpDiscoveryNode [id=2402793f-f484-4f3a-9213-82beeebfd09a, consistentId=76.3.16.110:23054, addrs=ArrayList [76.3.16.110], sockAddrs=HashSet [fl-76-3-16-110.dhcp.embarqhsd.net/76.3.16.110:23054], discPort=23054, order=15, intOrder=9, lastExchangeTime=1631517404103, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=false], msg=GridQueryCancelRequest [qryReqId=3560], errMsg=Failed to send message (node left topology): TcpDiscoveryNode [id=2402793f-f484-4f3a-9213-82beeebfd09a, consistentId=76.3.16.110:23054, addrs=ArrayList [76.3.16.110], sockAddrs=HashSet [fl-76-3-16-110.dhcp.embarqhsd.net/76.3.16.110:23054], discPort=23054, order=15, intOrder=9, lastExchangeTime=1631517404103, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=false]]

/etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4

::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 10.2.144.62 tools.cmc.rnd.huawei.com

/etc/networks

default 0.0.0.0 loopback 127.0.0.0 link-local 169.254.0.0

/etc/hostname

EulerOS

I don't know which configuration has a problem There are no similar problems in other environments Please look at it for me, thank you

You need to check local logs for the failed node: `Failed to send message (node left topology): TcpDiscoveryNode [id=2402793f-f484-4f3a-9213-82beeebfd09a` to detect the root case. — Alexandr Shapkin, Oct 28 '21 at 10:47
sockAddrs=HashSet [fl-76-3-16-110.dhcp.embarqhsd.net/76.3.16.110:23054]， other environments do not have host names. Is this related? — biandeqiang, Oct 28 '21 at 11:15
I don't know, there might be a variety of reasons behind a node failure and I suggest checking the logs in detail. What's your current problem? The nodes won't join each other or one of them is being disconnected from the cluster after some time? What about your environment? Is it cloud-based (k8s), how many nodes do you have? I mean, it's not clear what's your concern. — Alexandr Shapkin, Oct 28 '21 at 11:23
my problem is The nodes won't join each other. Two nodes in total。one node's log — biandeqiang, Oct 28 '21 at 12:04
Thank you very much. The problem has been found. The execution of the U.jvmPid() method times out. As a result, the node initialization is slow and the networking fails. The possible cause is that the /etc/hosts and /etc/sysconfig/network configurations are incorrect. — biandeqiang, Oct 29 '21 at 01:28

score 0 · Answer 1 · answered Oct 28 '21 at 12:08

 [2021-10-28 19:42:20,560][WARN ][0][0][tcp-disco-sock-reader-[d5c103a9 115.0.77.41:39585]-#4-#800][][IgniteLoggerImp][74] Failed to shutdown socket: closing inbound before receiving peer's close_notify javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify
    at sun.security.ssl.SSLSocketImpl.shutdownInput(SSLSocketImpl.java:735)
    at sun.security.ssl.SSLSocketImpl.shutdownInput(SSLSocketImpl.java:714)
    at org.apache.ignite.internal.util.IgniteUtils.close(IgniteUtils.java:4232)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7382)
    at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)

[2021-10-28 19:42:25,205][WARN ][0][0][jvm-pause-detector-worker][][IgniteLoggerImp][72] Possible too long JVM pause: 4503 milliseconds. [2021-10-28 19:42:47,285][WARN ][0][0][fm.monitor.rebuild-105-1][ROOT][IgniteLoggerImp][72] Query produced big result set. [fetched=100000, duration=1250ms, type=MAP, distributedJoin=false, enforceJoinOrder=false, lazy=false, schema=alarmCache, sql='SELECT\n__Z0.MERGED __C0_0,\n__Z0.SPECIALALARMSTATUS __C0_1,\n__Z0.NATIVEMODN __C0_2,\n__Z0.SEVERITY __C0_3,\n__Z0.ACKED __C0_4,\n__Z0.CLEARED __C0_5,\n__Z0.MEDN __C0_6,\n__Z0.CSN __C0_7\nFROM "alarmCache".ALARMRECORD __Z0', plan=SELECT\n __Z0.MERGED AS __C0_0,\n __Z0.SPECIALALARMSTATUS AS __C0_1,\n __Z0.NATIVEMODN AS __C0_2,\n __Z0.SEVERITY AS __C0_3,\n __Z0.ACKED AS __C0_4,\n __Z0.CLEARED AS __C0_5,\n __Z0.MEDN AS __C0_6,\n __Z0.CSN AS __C0_7\nFROM "alarmCache".ALARMRECORD __Z0\n /* "alarmCache".ALARMRECORD._SCAN /\n / scanCount: 140023 /, node=TcpDiscoveryNode [id=4b9a0352-7d68-4a51-8f31-b4ae2405ee7c, consistentId=115.0.77.40:23054, addrs=ArrayList [115.0.77.40], sockAddrs=HashSet [EulerOS/115.0.77.40:23054], discPort=23054, order=1, intOrder=1, lastExchangeTime=1635415317696, loc=true, ver=2.11.0#20210911-sha1:8f3f07d3, isClient=false], reqId=35, segment=1] [2021-10-28 19:42:47,287][WARN ][0][0][fm.monitor.rebuild-105-1][ROOT][IgniteLoggerImp][72] Query produced big result set. [fetched=100000, duration=1260ms, type=MAP, distributedJoin=false, enforceJoinOrder=false, lazy=false, schema=alarmCache, sql='SELECT\n__Z0.MERGED __C0_0,\n__Z0.SPECIALALARMSTATUS __C0_1,\n__Z0.NATIVEMODN __C0_2,\n__Z0.SEVERITY __C0_3,\n__Z0.ACKED __C0_4,\n__Z0.CLEARED __C0_5,\n__Z0.MEDN __C0_6,\n__Z0.CSN __C0_7\nFROM "alarmCache".ALARMRECORD __Z0', plan=SELECT\n __Z0.MERGED AS __C0_0,\n __Z0.SPECIALALARMSTATUS AS __C0_1,\n __Z0.NATIVEMODN AS __C0_2,\n __Z0.SEVERITY AS __C0_3,\n __Z0.ACKED AS __C0_4,\n __Z0.CLEARED AS __C0_5,\n __Z0.MEDN AS __C0_6,\n __Z0.CSN AS __C0_7\nFROM "alarmCache".ALARMRECORD __Z0\n / "alarmCache".ALARMRECORD._SCAN /\n / scanCount: 140688 /, node=TcpDiscoveryNode [id=4b9a0352-7d68-4a51-8f31-b4ae2405ee7c, consistentId=115.0.77.40:23054, addrs=ArrayList [115.0.77.40], sockAddrs=HashSet [EulerOS/115.0.77.40:23054], discPort=23054, order=1, intOrder=1, lastExchangeTime=1635415317696, loc=true, ver=2.11.0#20210911-sha1:8f3f07d3, isClient=false], reqId=35, segment=0] [2021-10-28 19:42:47,453][WARN ][0][0][fm.monitor.rebuild-105-1][ROOT][IgniteLoggerImp][72] Query produced big result set. [fetched=140022, duration=1422ms, type=MAP, distributedJoin=false, enforceJoinOrder=false, lazy=false, schema=alarmCache, sql='SELECT\n__Z0.MERGED __C0_0,\n__Z0.SPECIALALARMSTATUS __C0_1,\n__Z0.NATIVEMODN __C0_2,\n__Z0.SEVERITY __C0_3,\n__Z0.ACKED __C0_4,\n__Z0.CLEARED __C0_5,\n__Z0.MEDN __C0_6,\n__Z0.CSN __C0_7\nFROM "alarmCache".ALARMRECORD __Z0', plan=SELECT\n __Z0.MERGED AS __C0_0,\n __Z0.SPECIALALARMSTATUS AS __C0_1,\n __Z0.NATIVEMODN AS __C0_2,\n __Z0.SEVERITY AS __C0_3,\n __Z0.ACKED AS __C0_4,\n __Z0.CLEARED AS __C0_5,\n __Z0.MEDN AS __C0_6,\n __Z0.CSN AS __C0_7\nFROM "alarmCache".ALARMRECORD __Z0\n / "alarmCache".ALARMRECORD._SCAN /\n / scanCount: 140023 /, node=TcpDiscoveryNode [id=4b9a0352-7d68-4a51-8f31-b4ae2405ee7c, consistentId=115.0.77.40:23054, addrs=ArrayList [115.0.77.40], sockAddrs=HashSet [EulerOS/115.0.77.40:23054], discPort=23054, order=1, intOrder=1, lastExchangeTime=1635415317696, loc=true, ver=2.11.0#20210911-sha1:8f3f07d3, isClient=false], reqId=35, segment=1] [2021-10-28 19:42:47,461][WARN ][0][0][fm.monitor.rebuild-105-1][ROOT][IgniteLoggerImp][72] Query produced big result set. [fetched=140687, duration=1432ms, type=MAP, distributedJoin=false, enforceJoinOrder=false, lazy=false, schema=alarmCache, sql='SELECT\n__Z0.MERGED __C0_0,\n__Z0.SPECIALALARMSTATUS __C0_1,\n__Z0.NATIVEMODN __C0_2,\n__Z0.SEVERITY __C0_3,\n__Z0.ACKED __C0_4,\n__Z0.CLEARED __C0_5,\n__Z0.MEDN __C0_6,\n__Z0.CSN __C0_7\nFROM "alarmCache".ALARMRECORD __Z0', plan=SELECT\n __Z0.MERGED AS __C0_0,\n __Z0.SPECIALALARMSTATUS AS __C0_1,\n __Z0.NATIVEMODN AS __C0_2,\n __Z0.SEVERITY AS __C0_3,\n __Z0.ACKED AS __C0_4,\n __Z0.CLEARED AS __C0_5,\n __Z0.MEDN AS __C0_6,\n __Z0.CSN AS __C0_7\nFROM "alarmCache".ALARMRECORD __Z0\n / "alarmCache".ALARMRECORD._SCAN /\n / scanCount: 140688 */, node=TcpDiscoveryNode [id=4b9a0352-7d68-4a51-8f31-b4ae2405ee7c, consistentId=115.0.77.40:23054, addrs=ArrayList [115.0.77.40], sockAddrs=HashSet [EulerOS/115.0.77.40:23054], discPort=23054, order=1, intOrder=1, lastExchangeTime=1635415317696, loc=true, ver=2.11.0#20210911-sha1:8f3f07d3, isClient=false], reqId=35, segment=0] [2021-10-28 19:43:04,759][WARN ][0][0][jvm-pause-detector-worker][ROOT][IgniteLoggerImp][72] Possible too long JVM pause: 12688 milliseconds. [2021-10-28 19:43:04,775][WARN ][0][0][tcp-disco-msg-worker-[crd]-#2-#55][][IgniteLoggerImp][72] Failed to send message to next node [msg=TcpDiscoveryNodeAddedMessage [node=TcpDiscoveryNode [id=d5c103a9-4b8a-4430-bab1-fa63ad8066e7, consistentId=115.0.77.41:23054, addrs=ArrayList [115.0.77.41], sockAddrs=HashSet [EulerOS/115.0.77.40:23054, 115.0.77.41/115.0.77.41:23054], discPort=23054, order=0, intOrder=2, lastExchangeTime=1635421284474, loc=false, ver=2.11.0#20210911-sha1:8f3f07d3, isClient=false], dataPacket=o.a.i.spi.discovery.tcp.internal.DiscoveryDataPacket@492d48ec, discardMsgId=null, discardCustomMsgId=null, top=null, clientTop=null, gridStartTime=1635415317736, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=9d88656cc71-4b9a0352-7d68-4a51-8f31-b4ae2405ee7c, verifierNodeId=4b9a0352-7d68-4a51-8f31-b4ae2405ee7c, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=TcpDiscoveryNode [id=d5c103a9-4b8a-4430-bab1-fa63ad8066e7, consistentId=115.0.77.41:23054, addrs=ArrayList [115.0.77.41], sockAddrs=HashSet [EulerOS/115.0.77.40:23054, 115.0.77.41/115.0.77.41:23054], discPort=23054, order=0, intOrder=2, lastExchangeTime=1635421284474, loc=false, ver=2.11.0#20210911-sha1:8f3f07d3, isClient=false], errMsg=Failed to send message to next node [msg=TcpDiscoveryNodeAddedMessage [node=TcpDiscoveryNode [id=d5c103a9-4b8a-4430-bab1-fa63ad8066e7, consistentId=115.0.77.41:23054, addrs=ArrayList [115.0.77.41], sockAddrs=HashSet [EulerOS/115.0.77.40:23054, 115.0.77.41/115.0.77.41:23054], discPort=23054, order=0, intOrder=2, lastExchangeTime=1635421284474, loc=false, ver=2.11.0#20210911-sha1:8f3f07d3, isClient=false], dataPacket=o.a.i.spi.discovery.tcp.internal.DiscoveryDataPacket@492d48ec, discardMsgId=null, discardCustomMsgId=null, top=null, clientTop=null, gridStartTime=1635415317736, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=9d88656cc71-4b9a0352-7d68-4a51-8f31-b4ae2405ee7c, verifierNodeId=4b9a0352-7d68-4a51-8f31-b4ae2405ee7c, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=ClusterNode [id=d5c103a9-4b8a-4430-bab1-fa63ad8066e7, order=0, addr=[115.0.77.41], daemon=false]]] [2021-10-28 19:43:04,777][WARN ][0][0][tcp-disco-msg-worker-[crd]-#2-#55][ROOT][IgniteLoggerImp][72] Local node has detected failed nodes and started cluster-wide procedure. To speed up failure detection please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi' [2021-10-28 19:43:04,814][WARN ][0][0][disco-event-worker-#57][][IgniteLoggerImp][72] Node FAILED: TcpDiscoveryNode [id=d5c103a9-4b8a-4430-bab1-fa63ad8066e7, consistentId=115.0.77.41:23054, addrs=ArrayList [115.0.77.41], sockAddrs=HashSet [EulerOS/115.0.77.40:23054, 115.0.77.41/115.0.77.41:23054], discPort=23054, order=2, intOrder=2, lastExchangeTime=1635421284474, loc=false, ver=2.11.0#20210911-sha1:8f3f07d3, isClient=false]

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Oct 28 '21 at 12:35

score 0 · Answer 2 · edited Oct 28 '21 at 13:05

[2021-10-28 19:42:20,907][DEBUG][0][0][tcp-disco-msg-worker-[]-#2-#35][ROOT]
[IgniteLoggerImp][51] Message has been added to a worker's queue: TcpDiscoveryStatusCheckMessage [creatorNode=null, failedNodeId=null, status=0, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=04671b6cc71-d5c103a9-4b8a-4430-bab1-fa63ad8066e7, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]]

[2021-10-28 19:42:20,907][DEBUG][0][0][tcp-disco-msg-worker-[]-#2-#35][ROOT]
[IgniteLoggerImp][51] Processing message [cls=TcpDiscoveryStatusCheckMessage, id=04671b6cc71-d5c103a9-4b8a-4430-bab1-fa63ad8066e7]

[2021-10-28 19:42:20,907][DEBUG][0][0][tcp-disco-msg-worker-[]-#2-#35][ROOT]
[IgniteLoggerImp][51] Ignore message, local node order is not initialized 
[msg=TcpDiscoveryStatusCheckMessage [creatorNode=null, failedNodeId=null, status=0, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=04671b6cc71-d5c103a9-4b8a-4430-bab1-fa63ad8066e7, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], 
locNode=Tcp

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Oct 28 '21 at 13:05
Thank you very much. The problem has been found. The execution of the U.jvmPid() method times out. As a result, the node initialization is slow and the networking fails. The possible cause is that the /etc/hosts and /etc/sysconfig/network configurations are incorrect. — biandeqiang, Oct 29 '21 at 01:28

Ignite networking failed

2 Answers2