1

I have created an active/passive cluster using Pacemaker / Corosync / drbd and have "simulated" Apache failure pkill httpd and altough pacemaker recovered from the "failure" and started httpd now when executing pcs status I get:

Failed Actions:
* apache_monitor_60000 on server1 'not running' (7): call=39, status=complete, exitreason='none',
    last-rc-change='Wed May  9 09:55:45 2018', queued=0ms, exec=0ms

Why does pacemaker not clear the failed action after a successful recovery? Or is there any other way to clear the failed action other than manual?

U. Windl
  • 366
  • 3
  • 17
postFix
  • 41
  • 7

2 Answers2

1

That is by design. Some admins, myself included, like to see the error so that we know when it occurred and can investigate. Additionally, pacemaker needs to track these errors so that it can decide where best to start a resource.

Pacemaker does though have a method to clear failures after a specified time if no new failures have occurred. This is known as the failure-timeout. This can be configured per resource, but below is how you would specify it as a cluster-wide resource default with the crm shell. I would expect pcs would also have a method to define it.

crm configure rsc_defaults failure-timeout=15m

Please note that this is only checked upon the cluster-recheck-interval, which by default is every 15 minutes. With a failure-timeout of 15m set, depending upon when exactly the failure occurred, it is possible for this to take 29 minutes 59 seconds to clear.

Dok
  • 1,158
  • 1
  • 7
  • 13
  • Thank you for your comment. The errors tracking idea makes sense. Once I did a test and modified a configuration file (in such a way that you wouldn't be able to start it even with `systemctl`) for a service that was managed with pacemaker and I observed that after the monitor interval times the service and all other resources (all are part of a resource group) were moved to the other node. Note sure what you meant with 5 "non-fatal" errors. Can you point my to any resource mentioning it ? – postFix May 09 '18 at 19:02
  • 1
    Start and Stop errors are considered "fatal" by default and will immediately set the failcount to infinity. Monitor failures are non-fatal, which will increment the failcount by one and start a recovery. I was mistaken about the 5 failure threshold. I was referring to the migration-threshold value, which is not 5, but rather infinity by default. I have edited my above post to correct this error. This page of the "Configuration Explained" does a good job describing all of this: https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_failure_response.html – Dok May 09 '18 at 19:38
  • Thank you for clarifying. I'm aware of the documentation however what I found disturbing is that fact that the cluster from scratch documentation covers `pcs` but pacemaker explained uses `crm`. Haven't found a ultimate crm - to - pcs reference guide. I have this one [https://github.com/ClusterLabs/pacemaker/blob/master/doc/pcs-crmsh-quick-ref.md#move-resources] but it does not cover all commands. Do you have any better documentation ? – postFix May 10 '18 at 07:24
0

You can clear the error state manually, too ("cleanup"):

crm_resource -C -r apache -N server1 -n monitor

Obviously you specify the resource name, the node and the operation.

In case there is a local problem on the node, the error state prevents repeatedly trying and failing the operation on the bad node. When doing manual tests, a manual cleanup is more natural than some automatic cleanup (as requested).

It's a good habit to check the cluster for errors from time to time. And of course trying to fix those will establish a "good cluster".

U. Windl
  • 366
  • 3
  • 17