3

I'm setting up a test system for self-teaching about load balancing and high availability and I'm curious about a configuration setting in Corosync and would like to know what you guys who have experience in it, have to say.

The thing I'm researching and learning now is Corosync vote-quorum and how to deal with fallen nodes. During a small research session I found talk about STONITH and split-brain scenarios where both nodes will asume it is the sole survivor, and thinks its the master, attempts to stay master etc. This is ofcourse an unwanted scenario.

In Corosync configuration I saw a specific configuration:

quorum {
        ...
        auto_tie_breaker: 1
        auto_tie_breaker_node: lowest
}

Could the auto_tie_breaker prevent such a split-brain scenario, or am I mistaken?

If I understood the documentation right, setting it to lowest, means that the node with the lowest nodeid would be the one in charge?

nodelist {
          node {
          ring0_addr: primary_private_ip
          name: primary
          nodeid: 1
          }

          node {
          ring0_addr: secondary_private_ip
          name: secondary
          nodeid: 2
          }
}

Of course, I'm only testing on a two-node cluster at the moment, but aiming to get an understanding of how the process works, so I can successfully set up a more reliable infrastructure in the future.

Thanks for input and guidance and have a great day! :)

StianM
  • 75
  • 1
  • 8

1 Answers1

3

You are correct in the assumption that auto_tie_breaker will try to resolve a node failure in even node configs (1/1, 1/1/1/1, etc.) by "forcing" the cluster to remain connected to the correct set of nodes (or single node in two node clusters).

The general behaviour of votequorum allows a simultaneous node failure up to 50% - 1 node, assuming each node has 1 vote.

When ATB is enabled, the cluster can suffer up to 50% of the nodes failing at the same time, in a deterministic fashion. By default the cluster partition, or the set of nodes that are still in contact with the node that has the lowest nodeid will remain quorate. The other nodes will be inquorate. This behaviour can be changed by also specifying

Quorum votes for clusters usually had to be used either in n+1 node scenarios, or with the two_node parameter, where expected_votes had to be set to 2 and hardware fencing / STONITH had to be enabled.

auto_tie_breaker_node: lowest|highest|<list of node IDs>

'lowest' is the default, 'highest' is similar in that if the current set of nodes contains the highest nodeid then it will remain quorate. Alternatively it is possible to specify a particular node ID or list of node IDs that will be required to maintain quorum. If a (space-separated) list is given, the nodes are evaluated in order, so if the first node is present then it will be used to determine the quorate partition, if that node is not in either half (ie was not in the cluster before the split) then the second node ID will be checked for and so on. ATB is incompatible with quorum devices - if auto_tie_breaker is specified in corosync.conf then the quorum device will be disabled.

Remember: It is no STONITH device, and you can't use it with the two_node directive.

two_node: 1

Enables two node cluster operations (default: 0).

The "two node cluster" is a use case that requires special consideration. With a standard two node cluster, each node with a single vote, there are 2 votes in the cluster. Using the simple majority calculation (50% of the votes + 1) to calculate quorum, the quorum would be 2. This means that the both nodes would always have to be alive for the cluster to be quorate and operate.

Enabling two_node: 1, quorum is set artificially to 1

So the new go-to method for even node clusters without hardware fencing or STONITH is auto_tie_breaker.

In n+1 clusters, quorum votes still are quite reliable, but for high-profile linux HA, hardware fencing / STONITH should remain king.

As always, be sure to test all possible scenarios, like network outage, hardware failure, power loss, simultaneous resource error, DRBD errors (if used), etc. and read this document on the "new" features of corosync.

Lenniey
  • 5,220
  • 2
  • 18
  • 29
  • 1
    Thank you very much for the clear and well described answer! This cleared a lot of my confusions. You mention this: "Remember: It is no STONITH device, and you can't use it with the two_node directive." Does this mean that I'm unable to use "auto_tie_breaker: lowest" in my two node setup? Again, thank you very much for a good answer! :) – StianM Apr 19 '18 at 10:55
  • 1
    No, it means you can't us the `two_node` directive in your corosync config, but you don't even need to. This is because the cluster won't choose the primary node based on votequorum, but on the node-ID. The default votequorum doesn't work with 2-node-setups, that's why `two_node` was introduced in the first place. – Lenniey Apr 19 '18 at 11:23
  • Ahhh! Thank you for clearing that up! Really helpfull and sounds way more effective than the two-node directive! – StianM Apr 19 '18 at 19:21
  • I've come to decision to use a combination of last man standing, and auto tie breaker, to be ready for higher ammount of traffic. Last man standing as the main, and auto tie breaker when corosync detects it only has two nodes remaining. Is there anything I need to be aware of when implementign this setup? I know the individual configuration lines for each mode, but the documentation points out they complement eachother, but no description beyond that, no configuration for combined setup. – StianM Apr 23 '18 at 10:59
  • It's all in the docs: NOTES: In order for the cluster to downgrade automatically from 2 nodes to a 1 node cluster, the auto_tie_breaker feature must also be enabled (see below). If auto_tie_breaker is not enabled, and one more failure occurs, the remaining node will not be quorate. LMS does not work with asymmetric voting schemes, each node must vote 1. LMS is also incompatible with quorum devices, if last_man_standing is specified in corosync.conf then the quorum device will be disabled. What exactly do you need? How many nodes will you be configuring? – Lenniey Apr 23 '18 at 11:31
  • At this moment, only two nodes, but it is a part of a project that will become bigger, and a part of the exam is being prepared for extra traffic, the addition of extra load balancer nodes and worker nodes when necessary with as much ease as possible and as little reconfiguration as possible. It is also possible that when I'm done testing and running the project, that I will turn it into a very small webhotell to handle my sites, friends sites, and some communities. – StianM Apr 23 '18 at 12:05
  • So to add the configuration to add combined setup of last man standing and auto tie breaker would be as follow: quorum { provider: corosync_votequorum expected_votes: 8 last_man_standing: 1 last_man_standing_window: 20000 auto_tie_breaker: 1 auto_tie_breaker_node: lowest } Am I correct? – StianM Apr 23 '18 at 13:34