[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

Tue Sep 20 16:25:01 EDT 2016

Hi everybody,

Currently, Pacemaker's on-fail property allows you to configure how the
cluster reacts to operation failures. The default "restart" means try to
restart on the same node, optionally moving to another node once
migration-threshold is reached. Other possibilities are "ignore",
"block", "stop", "fence", and "standby".

Occasionally, we get requests to have something like migration-threshold
for values besides restart. For example, try restarting the resource on
the same node 3 times, then fence.

I'd like to get your feedback on two alternative approaches we're
considering.

###

Our first proposed approach would add a new hard-fail-threshold
operation property. If specified, the cluster would first try restarting
the resource on the same node, before doing the on-fail handling.

For example, you could configure a promote operation with
hard-fail-threshold=3 and on-fail=fence, to fence the node after 3 failures.

One point that's not settled is whether failures of *any* operation
would count toward the 3 failures (which is how migration-threshold
works now), or only failures of the specified operation.

Currently, if a start fails (but is retried successfully), then a
promote fails (but is retried successfully), then a monitor fails, the
resource will move to another node if migration-threshold=3. We could
keep that behavior with hard-fail-threshold, or only count monitor
failures toward monitor's hard-fail-threshold. Each alternative has
advantages and disadvantages.

###

The second proposed approach would add a new on-restart-fail resource
property.

Same as now, on-fail set to anything but restart would be done
immediately after the first failure. A new value, "ban", would
immediately move the resource to another node. (on-fail=ban would behave
like on-fail=restart with migration-threshold=1.)

When on-fail=restart, and restarting on the same node doesn't work, the
cluster would do the on-restart-fail handling. on-restart-fail would
allow the same values as on-fail (minus "restart"), and would default to
"ban".

So, if you want to fence immediately after any promote failure, you
would still configure on-fail=fence; if you want to try restarting a few
times first, you would configure on-fail=restart and on-restart-fail=fence.

This approach keeps the current threshold behavior -- failures of any
operation count toward the threshold. We'd rename migration-threshold to
something like hard-fail-threshold, since it would apply to more than
just migration, but unlike the first approach, it would stay a resource
property.

###

Comparing the two approaches, the first is more flexible, but also more
complex and potentially confusing.

With either approach, we would deprecate the start-failure-is-fatal
cluster property. start-failure-is-fatal=true would be equivalent to
hard-fail-threshold=1 with the first approach, and on-fail=ban with the
second approach. This would be both simpler and more useful -- it allows
the value to be set differently per resource.
-- 
Ken Gaillot <kgaillot at redhat.com>