[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

Wed Sep 21 05:06:34 EDT 2016

On 09/20/2016 10:25 PM, Ken Gaillot wrote:
> Hi everybody,
>
> Currently, Pacemaker's on-fail property allows you to configure how the
> cluster reacts to operation failures. The default "restart" means try to
> restart on the same node, optionally moving to another node once
> migration-threshold is reached. Other possibilities are "ignore",
> "block", "stop", "fence", and "standby".
>
> Occasionally, we get requests to have something like migration-threshold
> for values besides restart. For example, try restarting the resource on
> the same node 3 times, then fence.
>
> I'd like to get your feedback on two alternative approaches we're
> considering.
>
> ###
>
> Our first proposed approach would add a new hard-fail-threshold
> operation property. If specified, the cluster would first try restarting
> the resource on the same node, before doing the on-fail handling.
>
> For example, you could configure a promote operation with
> hard-fail-threshold=3 and on-fail=fence, to fence the node after 3 failures.
>
> One point that's not settled is whether failures of *any* operation
> would count toward the 3 failures (which is how migration-threshold
> works now), or only failures of the specified operation.
>
> Currently, if a start fails (but is retried successfully), then a
> promote fails (but is retried successfully), then a monitor fails, the
> resource will move to another node if migration-threshold=3. We could
> keep that behavior with hard-fail-threshold, or only count monitor
> failures toward monitor's hard-fail-threshold. Each alternative has
> advantages and disadvantages.
having something like reset-failcount-on-success might
be interesting here as well (as operation property, per resource
or both)
>
> ###
>
> The second proposed approach would add a new on-restart-fail resource
> property.
>
> Same as now, on-fail set to anything but restart would be done
> immediately after the first failure. A new value, "ban", would
> immediately move the resource to another node. (on-fail=ban would behave
> like on-fail=restart with migration-threshold=1.)
>
> When on-fail=restart, and restarting on the same node doesn't work, the
> cluster would do the on-restart-fail handling. on-restart-fail would
> allow the same values as on-fail (minus "restart"), and would default to
> "ban".
>
> So, if you want to fence immediately after any promote failure, you
> would still configure on-fail=fence; if you want to try restarting a few
> times first, you would configure on-fail=restart and on-restart-fail=fence.
>
> This approach keeps the current threshold behavior -- failures of any
> operation count toward the threshold. We'd rename migration-threshold to
> something like hard-fail-threshold, since it would apply to more than
> just migration, but unlike the first approach, it would stay a resource
> property.
>
> ###
>
> Comparing the two approaches, the first is more flexible, but also more
> complex and potentially confusing.
>
> With either approach, we would deprecate the start-failure-is-fatal
> cluster property. start-failure-is-fatal=true would be equivalent to
> hard-fail-threshold=1 with the first approach, and on-fail=ban with the
> second approach. This would be both simpler and more useful -- it allows
> the value to be set differently per resource.
As said both approaches have their pros and cons and you
can probably invent more that seem to be especially suitable
for certain cases.

A different approach would be to gather a couple of statistics
(e.g. restart failures globally & per operation and ...) and leave
it to the RA to derive a return-value from that info and what
it knows itself.

I know we had a similar discussion already ... and this thread
is probably a result of that ...

We could introduce a new RA-operation for getting the answer or
the new statistics are just another input for start.
A new operation might be appealing as this could - on demand -
be replaced by calling a custom script - no need to edit RAs; one
RA different custom scripts for different resources using the same RA ...

I know that using scripts makes it impossible to derive the
cluster-behavior from just seeing what is configure in the cib,
that the scripts have to be kept in sync over the nodes, ...

Can as well be an addition to one of the suggestions above.
Gathering additional statistics will probably be needed for them
anyway.