How to get Pacemaker to restart services in a group

Question

I have a setup with 2 machines with Pacemaker and Corosync installed. There I have a PostgreSQL Master/Slave Set running. The Master has a resource Group with a Virtual IP and two additional services that are supposed to run on the master. In case of a failover triggered through killing the database master, this migrates all the services in the group to the other node, which is exactly what I expect and want it to do.

The additional services however will just be marked as failed when I kill them and thats it. Since I only want a migration to happen when the database fails, that is actually fine. However I want Pacemaker to restart these services in case they fail, and not just mark them as failed.

My expectation was, that it will do exactly that when I add the on-fail=restart flag to the monitor op of these services, but this is not the case.

The group looks like this:

 Group: master-group
  Resource: VirtualIP (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=1.2.3.4 cidr_netmask=24 nic=ens1f0
   Operations: start interval=0s timeout=20s (VirtualIP-start-interval-0s)
               stop interval=0s timeout=20s (VirtualIP-stop-interval-0s)
               monitor interval=30s (VirtualIP-monitor-interval-30s)
  Resource: additional-resource1 (class=ocf provider=heartbeat type=additional-resource1)
   Operations: stop interval=0s timeout=20s (additional-resource1-stop-interval-0s)
               monitor interval=60s timeout=20s (additional-resource1-monitor-interval-60s)
               start interval=0s on-fail=restart timeout=20s (additional-resource1-start-interval-0s)
  Resource: additional-resource2 (class=lsb type=additional-resource2)
   Operations: start interval=10s on-fail=restart timeout=60s (additional-resource2-start-interval-10s)
               stop interval=0s timeout=20s (additional-resource2-stop-interval-0s)
               monitor interval=60s on-fail=restart timeout=0s (additional-resource2-monitor-interval-60s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote msPostgresql then start master-group (score:INFINITY) (non-symmetrical) (id:order-msPostgresql-master-group-INFINITY)
  demote msPostgresql then stop master-group (score:0) (non-symmetrical) (id:order-msPostgresql-master-group-0)
Colocation Constraints:
  master-group with msPostgresql (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-master-group-msPostgresql-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 resource-stickiness: INFINITY
 migration-threshold: 1
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 dc-version: 1.1.15-11.el7_3.4-e174ec8
 have-watchdog: false
 last-lrm-refresh: 1498820659
 no-quorum-policy: ignore
 stonith-enabled: false
Node Attributes:
 node1: pgsql-data-status=LATEST
 node2: pgsql-data-status=STREAMING|SYNC

Can anyone give an explanation on how to achieve this?

Can you share the type of failures you're seeing? Monitor operation failures, start operation failures, and stop operation failures are treated slightly differently depending on what's configured. Also, the INFINITY score for default stickiness is interesting; resource stickiness at 1000 is probably more than enough while being less definite. — Matt Kereczman, Jul 01 '17 at 16:19
Calling "pcs status" or "crom_mon -Afr -1" it actually doesn't show any errors, at all. It just lists services as Stopped once I kill them and does not attempt to restart. — juwi, Jul 03 '17 at 09:33
I actually just figured out that the monitor timeout might have been the issue. It now works, with a timeout of 10s. — juwi, Jul 03 '17 at 12:20

How to get Pacemaker to restart services in a group

0 Answers0