[ClusterLabs] Coming in Pacemaker 1.1.17: Per-operation fail counts

Mon Apr 3 11:00:38 EDT 2017

Hi all,

Pacemaker 1.1.17 will have a significant change in how it tracks
resource failures, though the change will be mostly invisible to users.

Previously, Pacemaker tracked a single count of failures per resource --
for example, start failures and monitor failures for a given resource
were added together.

In a thread on this list last year[1], we discussed adding some new
failure handling options that would require tracking failures for each
operation type.

Pacemaker 1.1.17 will include this tracking, in preparation for adding
the new options in a future release.

Whereas previously, failure counts were stored in node attributes like
"fail-count-myrsc", they will now be stored in multiple node attributes
like "fail-count-myrsc#start_0" and "fail-count-myrsc#monitor_10000"
(the number distinguishes monitors with different intervals).

Actual cluster behavior will be unchanged in this release (and
backward-compatible); the cluster will sum the per-operation fail counts
when checking against options such as migration-threshold.

The part that will be visible to the user in this release is that the
crm_failcount and crm_resource --cleanup tools will now be able to
handle individual per-operation fail counts if desired, though by
default they will still affect the total fail count for the resource.

As an example, if "myrsc" has one start failure and one monitor failure,
"crm_failcount -r myrsc --query" will still show 2, but now you can also
say "crm_failcount -r myrsc --query --operation start" which will show 1.

Additionally, crm_failcount --delete previously only reset the
resource's fail count, but it now behaves identically to crm_resource
--cleanup (resetting the fail count and clearing the failure history).

Special note for pgsql users: Older versions of common pgsql resource
agents relied on a behavior of crm_failcount that is now rejected. While
the impact is limited, users are recommended to make sure they have the
latest version of their pgsql resource agent before upgrading to
pacemaker 1.1.17.

[1] http://lists.clusterlabs.org/pipermail/users/2016-September/004096.html
-- 
Ken Gaillot <kgaillot at redhat.com>