1

i have 2 PCs that I run the following commands on both gfsh terminals:

start locator --name=locator1 --locators=ipaddress1[10334], ipaddress2[10334] start server --name=server1 --locators=ipaddress1[10334], ipaddress2[10334]

after they start, i am able to see all 4 members on both terminals when i list members.

NOW:
Say i run these commands on PC1 first, then PC2 second. (so PC1 is the first online). If i shutdown PC2, to simulate a PC failure, PC1 is ok. when i list members, it has 2 (locator and server).

I bring up PC2 and run the commands again and everything is good with 4 members again.

HOWEVER, if i shutdown the PC1 (being the first PC in the original cluster startup), PC2 drops connection with everything shortly after(about 5 seconds). gfsh connection is dropped and I am unable to connect to local host at all, but the process ids for the server and locator are still running.

It says in the LOG(s) Membership Service Failure: Exiting due to possible network partition event due to loss of 2 cache processes.

When I bring PC1 back online and run the locator and server commands, then i can connect again on PC2.

Can anyone help me with this??? I am having a really hard time trying to figure out what is happening here.

1 Answers1

1

Geode members automatically shutdown themselves whenever more than 52% of the membership quorum has been lost, basically to prevent split-brain situations and data corruption. You can find more details about this in Network Partitioning.

Cheers.

Juan Ramos
  • 1,421
  • 1
  • 8
  • 13
  • With a default value of enable-network-partition-detection, any member that detects that the total membership weight has dropped below 51% within a single membership view change (loss of quorum) declares a network partition event. I have this set to FALSE though in the gemfire.properties file on both PC1 and PC2. Wouldn't this prevent the membership quorum issue? – Michael Thoresen Aug 25 '21 at 12:31
  • If you set `enable-network-partition-detection` property as `false` (not recommended), the quorum algorithm shouldn't kick in at all. Have you checked the logs to make sure the property value change took effect?. – Juan Ramos Aug 25 '21 at 13:48
  • Thank you for your posting. I was able to get it to work finally. I understand that it is not recommended. I would def. not have this for production environment. Basically just trying to put together a "demo" for my group on these types of frameworks. Very much appreciated and helped a lot with your first response. Thanks – Michael Thoresen Sep 02 '21 at 15:03