I have a pair of servers setup as high availability load balancers/reverse proxies. Each runs Ubuntu 12.04 x64 Server, Varnish, Heartbeat and Pacemaker, with Varnish load balancing traffic to back-end servers.
If either of the load balancers falls over, Heartbeat/Pacemaker transfers a group of virtual IPs over to the other server, and traffic flow resumes. This bit works fine.
What I haven't accounted for is if Varnish isn't running on either server. It's currently possible to stop Varnish without triggering any sort of action from Heartbeat/Pacemaker. I'd like the absense of an operational Varnish on the current server to trigger a move to the backup (rather than attempting to restart Varnish), but I'm struggling to find any sort of guidance online. Can anyone help?
Edit post-Daff's assistance:
I ended up with something a little different from my original request: Pacemaker attempts to restart Varnish once, and if that fails, it moves all resources to the passive node.
My setup is two servers, serverA (active) and serverB (passive). I'll assume that the messaging layer (Heartbeat or Corosync) is already setup and working. To allow Pacemaker to control Varnish, we need to fix Ubuntu's Varnish init script:
sudo vim /etc/init.d/varnish
Replace:
--start --quiet --pidfile ${PIDFILE} --exec ${DAEMON} -- \
in the start_varnish_d() function with:
--start --quiet --pidfile ${PIDFILE} --oknodo --exec ${DAEMON} -- \
so it works in accordance with the rules outlined here. Now setup a basic Pacemaker cluster on serverA with a two virtual IPs:
sudo crm configure property no-quorum-policy=ignore
sudo crm configure property stonith-enabled=false
sudo crm configure primitive virtual_ip_1 ocf:heartbeat:IPaddr params ip="192.168.1.134" nic="eth1" cidr_netmask="24" broadcast="192.168.1.255" op monitor interval="10s" timeout="20s"
sudo crm configure primitive virtual_ip_2 ocf:heartbeat:IPaddr params ip="192.168.1.135" nic="eth1" cidr_netmask="24" broadcast="192.168.1.255" op monitor interval="10s" timeout="20s"
Add a primitive for Varnish, providing a monitoring frequency, and generous timings for starting and stopping Varnish:
sudo crm configure primitive varnish lsb:varnish op monitor interval="10s" timeout="20s" op
start interval="0" timeout="15s" op stop interval="0" timeout="15s"
Group the varnish primitive with the virtual IPs, so Pacemaker migrates all resources to the passive node in the event of a failure:
sudo crm configure group cluster virtual_ip_1 virtual_ip_2 varnish
Set migration-threshold, so Pacemaker will tolerate two failures before moving all resources to the passive node. For Varnish, this means one initial failure, plus one failed restart attempt:
sudo crm_attribute --type rsc_defaults --attr-name migration-threshold --attr-value 2
Set a failure timeout. This seems to do two things:
Gives Pacemaker the time for one Varnish restart attempt before migrating to the passive node.
Prevents the failed node being marked as failed after 30s, allowing resources to be moved back to it without manually having to run crm resource cleanup varnish after a failure. This is good thing for my setup, as I don't have any weightings set on the nodes, but it could be a really bad idea in a different environment.
sudo crm_attribute --type rsc_defaults --attr-name failure-timeout --attr-value 30s
And that's it, but see Daff's answer below for comments about stickiness, which I didn't end up using. The only downside I can see is that if you manually put a node into standby, Pacemaker will shutdown Varnish on that node, thus clearing the in-memory cache. For me, that isn't a particularly big deal.