1

I have 2 nodes : - patroni1 : 192.168.1.38 - patroni2 : 192.168.1.39

and Virtual IP : 192.168.1.40

I have HA-Proxy installed on both.

Here is my pcs status when VIP is attached to patroni2 and haproxy is activated on patroni2

-----------
[root@patroni1 ~]# pcs status
Cluster name: haproxy_cluster
Stack: corosync
Current DC: patroni2 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Thu Nov 29 21:29:00 2018
Last change: Thu Nov 29 21:24:52 2018 by root via cibadmin on patroni1

2 nodes configured
4 resources configured

Online: [ patroni1 patroni2 ]

Full list of resources:

 xen-fencing-patroni2   (stonith:fence_xenapi): Started patroni1
 xen-fencing-patroni1   (stonith:fence_xenapi): Started patroni2
 Resource Group: HAproxyGroup
     haproxy    (ocf::heartbeat:haproxy):   Started patroni2
     VIP    (ocf::heartbeat:IPaddr2):   Started patroni2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@patroni1 ~]# pcs resource show VIP
 Resource: VIP (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=24 ip=192.168.1.40
  Operations: monitor interval=1s (VIP-monitor-interval-1s)
              start interval=0s timeout=20s (VIP-start-interval-0s)
              stop interval=0s timeout=20s (VIP-stop-interval-0s)
[root@patroni1 ~]# pcs resource show haproxy
 Resource: haproxy (class=ocf provider=heartbeat type=haproxy)
  Attributes: binpath=/usr/sbin/haproxy conffile=/etc/haproxy/haproxy.cfg
  Operations: monitor interval=10s (haproxy-monitor-interval-10s)
              start interval=0s timeout=20s (haproxy-start-interval-0s)
              stop interval=0s timeout=20s (haproxy-stop-interval-0s)

-----------

My problem is : fencing is not triggered whenever I manualy kill haproxy on patroni2. fencing is only triggered when I manualy halt or reboot patroni2.

here is pcs status when I manualy kill haproxy

------------
[root@patroni1 ~]# pcs status
Cluster name: haproxy_cluster
Stack: corosync
Current DC: patroni2 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Thu Nov 29 21:37:37 2018
Last change: Thu Nov 29 21:24:52 2018 by root via cibadmin on patroni1

2 nodes configured
4 resources configured

Online: [ patroni1 patroni2 ]

Full list of resources:

 xen-fencing-patroni2   (stonith:fence_xenapi): Started patroni1
 xen-fencing-patroni1   (stonith:fence_xenapi): Started patroni2
 Resource Group: HAproxyGroup
     haproxy    (ocf::heartbeat:haproxy):   Started patroni2
     VIP    (ocf::heartbeat:IPaddr2):   Starting patroni2

Failed Actions:
* haproxy_monitor_10000 on patroni2 'not running' (7): call=38, status=complete, exitreason='',
    last-rc-change='Thu Nov 29 21:37:36 2018', queued=0ms, exec=0ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

------------

How to make fencing trigered when HA-Proxy not responding ?

Sincerely -bino-

Bino Oetomo
  • 227
  • 1
  • 3
  • 11

1 Answers1

4

What you're observing is the expected behavior. Just because a resource is stopped, doesn't mean the best course of action is to forcefully power-cycle the system.

You manually kill HA-Proxy, Pacemaker detects that this service is for some reason not running, and logs this failure: haproxy_monitor_10000 on patroni2 'not running' [...]. The cluster then restarts this service. Which I would assume worked successfully as the cluster now shows the service is running without issue on the very same patroni2 node.

A monitor operation failure is not considered fatal, and as such it will not escalate to a STONITH action. A failure on a stop operation is considered a fatal failure though. If the cluster can't stop the resource, how can it restart it, or failover? By fencing the node, and power-cycling it via STONITH.

Dok
  • 1,158
  • 1
  • 7
  • 13
  • Thankyou @dok for your explanation. So how to guaranted that if one service is fail the VIP will move to another node? am I take wrong path to achive my goal ? – Bino Oetomo Dec 01 '18 at 01:56
  • 1
    A monitor failure makes the service restart. A start or stop failure of a resource will make the services migrate. To test simply develop a situation where a service fails to start or stop. I am not personally familiar with haproxy, but perhaps give it an invalid configuration to cause a start failure, then kill the process to trigger a monitor failure. Assuming it fails to start with the nonesense config, it should then fail to start on the recover operation and failover. – Dok Dec 03 '18 at 17:46
  • If your goal is really to fail over if monitoring fails for HAProxy, you can *try* `pcs resource update op monitor on-fail=fence`. My gut tells me that you're going to get into trouble, that the monitoring will fail sporadically and you'll end up with a longer outage. – Mike Andrews Dec 06 '18 at 20:50