0

I have a cluster situation consisting of 4 total nodes, 3 servers and 1 management node, working properly.

At the beginning of the month we planned to patch the OS and we started from the first server node with this procedure:

  • Stop service
  • S.O. patching
  • Server restart
  • Start service

The service of the first patched node named "serverA" fails to restart with this error:

Log entries cluster join: serverA: | INFO | region-dm-12 | ache.geode.internal.tcp.Connection | --> Connection: shared=true ordered=false failed to connect to peer 10.237.110.195( Server serverB:9993):1024 because: java.net.ConnectException: Connection timed out (Connection timed out) | WARN | region-dm-12 | ache.geode.internal.tcp.Connection | --> Connection: Attempting reconnect to peer 10.237.110.195( Server serverB:9993):1024

ServerMgmt: | WARN | pool-3-thread-1 | tributed.internal.ReplyProcessor21 | --> 15 seconds have elapsed while waiting for replies: <CreateRegionProcessor$CreateRegionReplyProcessor 44180 waiting for 1 replies from [10.237.110.194( Server serverA:632):1024]> on 10.237.110.225( Management:6033):1024 whose current membership list is: [[10.237.110.196( Server serverC:16805):1024, 10.237.110.225( Management:6033):1024, 10.237.110.195( Server serverB:9993):1024, 10.237.110.194( Server serverA:632):1024]]

The connection between the systems was verified with tcpdumps, udp 1024 is running fine.

We have tried redeploying the service and making numerous attempts but we always get the same error during startup.

Any suggestions? Thank you.

Marco.

Olaf Kock
  • 46,930
  • 8
  • 59
  • 90

2 Answers2

0

I think to see this error message, serverA was probably able to send UDP messages to serverB but it is failing to create a TCP connection. It's hard to say why though - a firewall issue, some TCP configuration issue, ... ?

Check to see if serverB has anything interesting in its logs. Since you are using TCP dump, you should be watching for that TCP connection for serverB:9993, since it looks like that is wwhat failed.

Dan Smith
  • 481
  • 2
  • 3
0

There is no firewall between the systems, we've analyzed again the network connection, during startup from node a, and we can see that the communication can be established between all systems. But what we detected is, that on port 2323 which is configured as locater, the node sends packages to the b and c node, but only receives back packages from the c node, and not from the b node. This is for us again a sign that the b node has an issue. Does it give a way to check our assumption from the b node?

Tcpdump overview

A node ip .194

B node ip .195

C node ip .196

Management ip .225
Siva Shanmugam
  • 662
  • 9
  • 19
Mike Marti
  • 61
  • 1
  • 1
  • 5