can heartbeat notice a stopped service and re-start it

Question

I have two nodes, the complete IP failover works for them very well - when node 1 is down, node 2 grabs the IP and starts services.

What i would love to know is:

if the server 1 does not crash completely, but only one of the services stop unexpectedly, can heartbeat be configured to watch out for it and start it?

EDIT: is it possible with pacemaker?

a quote from http://clusterlabs.org/wiki/FAQ#Organizational

Pacemaker keeps your applications running when they or the machines they're running on fail

You might be better off using something like monit. – ceejayoz Mar 11 '13 at 16:30 — ceejayoz, Mar 11 '13 at 16:30

score 0 · Accepted Answer · answered Mar 11 '13 at 17:10

if the server 1 does not crash completely, but only one of the services stop unexpectedly, can heartbeat be configured to watch out for it and start it?

Sure, heartbeat version 2 can do it.

With version 1, your /etc/ha.d/haresources looks like this:

master              129.79.136.4 apache

then you can generate the heartbeat 2 configuration file by running:

python /usr/lib64/heartbeat/haresources2cib.py > /var/lib/heartbeat/crm/cib.xml

/var/lib/heartbeat/crm/cib.xml

...
    <resources>
        <group id="group_1">
            <primitive class="ocf" id="IPaddr_129_79_136_4" provider="heartbeat" type="IPaddr">
                <operations>
                    <op id="IPaddr_129_79_136_4_mon" interval="5s" name="monitor" timeout="5s"/>
                </operations>
                <instance_attributes id="IPaddr_129_79_136_4_inst_attr">
                    <attributes>
                        <nvpair id="IPaddr_129_79_136_4_attr_0" name="ip" value="129.79.136.4"/>
                    </attributes>
                </instance_attributes>
            </primitive>
            <primitive class="ocf" id="apache_2" provider="heartbeat" type="apache">
                <operations>
                    <op id="apache_2_mon" interval="120s" name="monitor" timeout="60s"/>
                </operations>
            </primitive>
        </group>
    </resources>
...

But I would suggest you should go with corosync and Pacemaker.

score 0 · Answer 2 · edited Apr 13 '17 at 12:14

Do not use Heartbeat (v2). At all. Migrate away from it at once and go with Pacemaker as the cluster resource manager and Corosync as the messaging layer. You will not even find support for Heartbeat v2 in any of the various Linux HA community channels and in fact you might get yelled at for using obsolete technology. Perfectly good successors to Heartbeat exist in Pacemaker and Corosync.

Pacemaker is the cluster resource manager for the Linux HA stack and is designed exactly for what you are asking. It monitors resources (IP addresses, services, file systems, mount points, routes, ...) and can and will try to restart them should they fail. Of course it also does much more than that.

I will link one of my previous answers here because there's no real point in repeating myself further: Heartbeat won't successfully start up resources from a cold boot when a failed node is present

can heartbeat notice a stopped service and re-start it

2 Answers2