[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

Thu Sep 22 20:51:57 CEST 2016

On 09/22/2016 12:58 PM, Kristoffer Grönlund wrote:
> Ken Gaillot <kgaillot at redhat.com> writes:
>>
>> "restart" is the only on-fail value that it makes sense to escalate.
>>
>> block/stop/fence/standby are final. Block means "don't touch the
>> resource again", so there can't be any further response to failures.
>> Stop/fence/standby move the resource off the local node, so failure
>> handling is reset (there are 0 failures on the new node to begin with).
> 
> Hrm. If a restart potentially migrates the resource to a different node,
> is the failcount reset then as well? If so, wouldn't that complicate the
> hard-fail-threshold variable too, since potentially, the resource could
> keep migrating between nodes and since the failcount is reset on each
> migration, it would never reach the hard-fail-threshold. (or am I
> missing something?)

The failure count is specific to each node. By "failure handling is
reset" I mean that when the resource moves to a different node, the
failure count of the original node no longer matters -- the new node's
failure count is now what matters.

A node's failure count is reset only when the user manually clears it,
or the node is rebooted. Also, resources may have a failure-timeout
configured, in which case the count will go down as failures expire.

So, a resource with on-fail=restart would never go back to a node where
it had previously reached the threshold, unless that node's fail count
were cleared in one of those ways.