0

I have a 5 node Service Grid running on an Ignite 2.10.0 Cluster. Testing the upgrade, I stop one Server Node (SIGTERM) and wait for it it rejoin. It fails to stay connected to the cluster?

Each node is a primary micro service provider and a back for another (Cluster Singletons). The service that was running on the node that left the cluster is properly picked up by it's backup node. However, the server node can not stay connected to the cluster ever again!

Rejoin strategy:

  1. Let systemd restart ignite.
  2. The node rejoins, but then the new Server Node invokes it's shutdown-hook
  3. Go back to 1

I have no idea why the rejoined node shuts itself down. As far as I can tell, the Coordinator did not kill this youngest Server Node. I am logging with DEBUG and IGNITE_QUEIT set to false; I still can't find anyting in the logs.

I tried increasing network timeouts, but the newly re-joined node still shuts down???

Any idea what is going on or where to look?

Thanks in advance.

Greg

Environment:

RHEL 7.9, Java 11

Ignite configuration:

  • persistence is set to false.
  • clientReconnectDisabled is set to true
  • OK, the issue has to do with systemd, not ignite. Systemd on the RHEL7 box is finding the PID file??? any clue? – Greg Sylvain Dec 09 '21 at 21:45
  • try making a brand new cluster completely separate from the problematic one. Use ignite.sh instead of systemd. You might also be running into a memory issue. Check if you have an "Out Of Memory" process shutdown enabled . See: https://serverfault.com/questions/141988/avoid-linux-out-of-memory-application-teardown – Alex K Dec 14 '21 at 18:47

0 Answers0