0

I am trying to understand the Heartbeat setup in a new environment. It is a 2-node cluster that is still using Version 1 of Heartbeat (the one that does not use Pacemaker CRM) and I have a fundamental question that I could not find an easy to understand answer from google.

The question is, in case of a communication failure between the nodes in the cluster, but both the nodes still functioning well, how does the Cluster Manager identify which node is to be shot down? I see a ping_groupdirective in /etc/ha.d/ha.cf. From what I read, I see that the Cluster Manager will check the connectivity to any of the nodes mentioned in ping_group and checks the connection from which cluster node is alive and from that it decides which node to be shot down(?) What if connections from both the nodes to the ping nodes are alive and only the heartbeat network between both the nodes in the cluster is down? What am I missing here?

Situation: Only the heartbeat network is down, but both the nodes are UP and fine.

root@automan00:/root : cat /etc/ha.d/ha.cf
debugfile       /var/log/ha-debug
logfile         /var/log/ha-log
logfacility     local0
keepalive       500ms
deadtime        30
warntime        10
initdead        120
udpport         694
baud            19200
bcast           bond1 eth2
auto_failback   off
node            automan00
node            automan01
ping_group group1 1.1.1.1 2.2.2.2
respawn hacluster /usr/lib64/heartbeat/ipfail
realtime on

# stonith directive
stonith external/riloe /etc/ha.d/riloe.cfg
Sreeraj
  • 464
  • 1
  • 5
  • 15
  • 1
    you have shoot the **other node** in the head configured (stonith) - you have to weight certain heuristics with metrics - such as ping gateway, connect db, resolve host, service status etc. if host A is active but host B has higher score metrics then, host B can shoot host A in the head (as in boom! - powercycle - I'm the active node) - so if hosts are not being fenced it's because there is no metric associated with heartbeat network. - looks like you are only pinging hosts as a health check so anything else is "fine" – Sum1sAdmin May 30 '16 at 11:51
  • @Sum1sAdmin Thank you. Where do we tell the setup how to calculate the score? We do that in the `ha.cf` file itself generally? Is there a sample file anywhere that I can look at to see how to configure 'scoring'(?) ? – Sreeraj May 30 '16 at 11:56
  • what is you cluster software? – Sum1sAdmin May 30 '16 at 12:17
  • Its a 2-node DRBD cluster with Heartbeat version 1 (no Pacemaker). – Sreeraj May 30 '16 at 12:20
  • so you need to build a new cluster then :-) ..... Heartbeat is old and depreciated - use corosync + pacemaker maybe – Sum1sAdmin May 30 '16 at 12:25
  • I have suggested that, but do not have the privilege to make the decision. In the mean-time I am preparing a document on how the current cluster is set-up and I am trying to figure out how the 'weaker' node is identified by the cluster at present. – Sreeraj May 30 '16 at 12:29

1 Answers1

0

Maybe you can set a crossover cable between the nodes with some private IP's as another private network on HB.

However: When communication failed between only 2 nodes you don't know which node to shoot down, this is why you need a third node before going to production.

Without the third node being able to leverage who is working properly and who is not you will find yourself with a Split Brain situation .

https://en.wikipedia.org/wiki/Split-brain_(computing)

It's not a good practice to have a kill myself tool, like a last man button or so, because you will never know what happens with the other node. If the comunication failed or the other host just went south , you will see the same behaviour, so you can not kill yourself in any of those cases. And the same goes for the other node point of view.

I know this is not a solution, but I hope it will help understand the way CRM works. If you build a cluster try to use more than 2 nodes, is that simple.

Marc Riera
  • 1,637
  • 4
  • 23
  • 38