[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

Wed Sep 21 16:17:38 EDT 2016

On 09/20/2016 07:51 PM, Andrew Beekhof wrote:
> 
> 
> On Wed, Sep 21, 2016 at 6:25 AM, Ken Gaillot <kgaillot at redhat.com
> <mailto:kgaillot at redhat.com>> wrote:
> 
>     Hi everybody,
> 
>     Currently, Pacemaker's on-fail property allows you to configure how the
>     cluster reacts to operation failures. The default "restart" means try to
>     restart on the same node, optionally moving to another node once
>     migration-threshold is reached. Other possibilities are "ignore",
>     "block", "stop", "fence", and "standby".
> 
>     Occasionally, we get requests to have something like migration-threshold
>     for values besides restart. For example, try restarting the resource on
>     the same node 3 times, then fence.
> 
>     I'd like to get your feedback on two alternative approaches we're
>     considering.
> 
>     ###
> 
>     Our first proposed approach would add a new hard-fail-threshold
>     operation property. If specified, the cluster would first try restarting
>     the resource on the same node, 
> 
> 
> Well, just as now, it would be _allowed_ to start on the same node, but
> this is not guaranteed.
>  
> 
>     before doing the on-fail handling.
> 
>     For example, you could configure a promote operation with
>     hard-fail-threshold=3 and on-fail=fence, to fence the node after 3
>     failures.
> 
> 
>     One point that's not settled is whether failures of *any* operation
>     would count toward the 3 failures (which is how migration-threshold
>     works now), or only failures of the specified operation.
> 
> 
> I think if hard-fail-threshold is per-op, then only failures of that
> operation should count.
>  
> 
> 
>     Currently, if a start fails (but is retried successfully), then a
>     promote fails (but is retried successfully), then a monitor fails, the
>     resource will move to another node if migration-threshold=3. We could
>     keep that behavior with hard-fail-threshold, or only count monitor
>     failures toward monitor's hard-fail-threshold. Each alternative has
>     advantages and disadvantages.
> 
>     ###
> 
>     The second proposed approach would add a new on-restart-fail resource
>     property.
> 
>     Same as now, on-fail set to anything but restart would be done
>     immediately after the first failure. A new value, "ban", would
>     immediately move the resource to another node. (on-fail=ban would behave
>     like on-fail=restart with migration-threshold=1.)
> 
>     When on-fail=restart, and restarting on the same node doesn't work, the
>     cluster would do the on-restart-fail handling. on-restart-fail would
>     allow the same values as on-fail (minus "restart"), and would default to
>     "ban". 
> 
> 
> I do wish you well tracking "is this a restart" across demote -> stop ->
> start -> promote in 4 different transitions :-)
>  
> 
> 
>     So, if you want to fence immediately after any promote failure, you
>     would still configure on-fail=fence; if you want to try restarting a few
>     times first, you would configure on-fail=restart and
>     on-restart-fail=fence.
> 
>     This approach keeps the current threshold behavior -- failures of any
>     operation count toward the threshold. We'd rename migration-threshold to
>     something like hard-fail-threshold, since it would apply to more than
>     just migration, but unlike the first approach, it would stay a resource
>     property.
> 
>     ###
> 
>     Comparing the two approaches, the first is more flexible, but also more
>     complex and potentially confusing.
> 
> 
> More complex to implement or more complex to configure?

I was thinking more complex in behavior, so perhaps harder to follow /
expect.

For example, "After two start failures, fence this node; after three
promote failures, put the node in standby; but if a monitor failure is
the third operation failure of any type, then move the resource to
another node."

Granted, someone would have to inflict that on themselves :) but another
sysadmin / support tech / etc. who had to deal with the config later
might have trouble following it.

To keep the current default behavior, the default would be complicated,
too: "1 for start and stop operations, and 0 for other operations" where
"0 is equivalent to 1 except when on-fail=restart, in which case
migration-threshold will be used instead".

And then add to that tracking fail-count per node+resource+operation
combination, with the associated status output and cleanup options.
"crm_mon -f" currently shows failures like:

* Node node1:
   rsc1: migration-threshold=3 fail-count=1 last-failure='Wed Sep 21
15:12:59 2016'

What should that look like with per-op thresholds and fail-counts?

I'm not saying it's a bad idea, just that it's more complicated than it
first sounds, so it's worth thinking through the implications.

>     With either approach, we would deprecate the start-failure-is-fatal
>     cluster property. start-failure-is-fatal=true would be equivalent to
>     hard-fail-threshold=1 with the first approach, and on-fail=ban with the
>     second approach. This would be both simpler and more useful -- it allows
>     the value to be set differently per resource.
>     --
>     Ken Gaillot <kgaillot at redhat.com <mailto:kgaillot at redhat.com>>