[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

Thu Sep 22 11:58:07 EDT 2016

On 09/22/2016 09:53 AM, Jan Pokorný wrote:
> On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote:
>> Ken Gaillot <kgaillot at redhat.com> writes:
>>
>>> I'm not saying it's a bad idea, just that it's more complicated than it
>>> first sounds, so it's worth thinking through the implications.
>>
>> Thinking about it and looking at how complicated it gets, maybe what
>> you'd really want, to make it clearer for the user, is the ability to
>> explicitly configure the behavior, either globally or per-resource. So
>> instead of having to tweak a set of variables that interact in complex
>> ways, you'd configure something like rule expressions,
>>
>> <on_fail>
>>   <restart repeat="3" />
>>   <migrate timeout="60s" />
>>   <fence/>
>> </on_fail>
>>
>> So, try to restart the service 3 times, if that fails migrate the
>> service, if it still fails, fence the node.
>>
>> (obviously the details and XML syntax are just an example)
>>
>> This would then replace on-fail, migration-threshold, etc.
> 
> I must admit that in previous emails in this thread, I wasn't able to
> follow during the first pass, which is not the case with this procedural
> (sequence-ordered) approach.  Though someone can argue it doesn't take
> type of operation into account, which might again open the door for
> non-obvious interactions.

"restart" is the only on-fail value that it makes sense to escalate.

block/stop/fence/standby are final. Block means "don't touch the
resource again", so there can't be any further response to failures.
Stop/fence/standby move the resource off the local node, so failure
handling is reset (there are 0 failures on the new node to begin with).

"Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
then migrate", but I can't think of a real-world situation where that
makes sense, and it would be a significant re-implementation of "ignore"
(which currently ignores the state of having failed, as opposed to a
particular instance of failure).

What the interface needs to express is: "If this operation fails,
optionally try a soft recovery [always stop+start], but if <N> failures
occur on the same node, proceed to a [configurable] hard recovery".

And of course the interface will need to be different depending on how
certain details are decided, e.g. whether any failures count toward <N>
or just failures of one particular operation type, and whether the hard
recovery type can vary depending on what operation failed.