[ClusterLabs] Antw: Coming in Pacemaker 1.1.17: Per-operation fail counts

Tue Apr 4 02:18:55 EDT 2017

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 03.04.2017 um 17:00 in Nachricht
<ae3a7cf4-2ef7-4c4f-ae3f-39f473ed6c01 at redhat.com>:
> Hi all,
> 
> Pacemaker 1.1.17 will have a significant change in how it tracks
> resource failures, though the change will be mostly invisible to users.
> 
> Previously, Pacemaker tracked a single count of failures per resource --
> for example, start failures and monitor failures for a given resource
> were added together.

That is "per resource operation", not "per resource" ;-)

> 
> In a thread on this list last year[1], we discussed adding some new
> failure handling options that would require tracking failures for each
> operation type.

So the existing set of operations failures was restricted to start/stop/monitor? How about master/slave featuring two monitor operations?

> 
> Pacemaker 1.1.17 will include this tracking, in preparation for adding
> the new options in a future release.
> 
> Whereas previously, failure counts were stored in node attributes like
> "fail-count-myrsc", they will now be stored in multiple node attributes
> like "fail-count-myrsc#start_0" and "fail-count-myrsc#monitor_10000"
> (the number distinguishes monitors with different intervals).

Wouldn't it be thinkable to store is as (transient) resource attribute, either local to a node (LRM) or including the node attribute (CRM)?

> 
> Actual cluster behavior will be unchanged in this release (and
> backward-compatible); the cluster will sum the per-operation fail counts
> when checking against options such as migration-threshold.
> 
> The part that will be visible to the user in this release is that the
> crm_failcount and crm_resource --cleanup tools will now be able to
> handle individual per-operation fail counts if desired, though by
> default they will still affect the total fail count for the resource.

Another thing to think about would be "fail count" vs. "fail rate": Currently there is a fail count, and some reset interval, which allows to build some failure rate from it. Maybe many users just have the requirement that some resource shouldn't fail again and again, but with long uptimes (and then the operatior forgets to reset fail counters), occasional failures (like once in two weeks) shouldn't prevent a resource from running.

> 
> As an example, if "myrsc" has one start failure and one monitor failure,
> "crm_failcount -r myrsc --query" will still show 2, but now you can also
> say "crm_failcount -r myrsc --query --operation start" which will show 1.

Would accumulated monitor failures ever prevent a resource from starting, or will it force a stop of the resource?

Regards,
Ulrich

> 
> Additionally, crm_failcount --delete previously only reset the
> resource's fail count, but it now behaves identically to crm_resource
> --cleanup (resetting the fail count and clearing the failure history).
> 
> Special note for pgsql users: Older versions of common pgsql resource
> agents relied on a behavior of crm_failcount that is now rejected. While
> the impact is limited, users are recommended to make sure they have the
> latest version of their pgsql resource agent before upgrading to
> pacemaker 1.1.17.
> 
> [1] http://lists.clusterlabs.org/pipermail/users/2016-September/004096.html 
> -- 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org