[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

Thu Sep 29 03:54:34 UTC 2016

On Sat, Sep 24, 2016 at 9:12 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
> On 09/22/2016 05:58 PM, Andrew Beekhof wrote:
>>
>>
>> On Fri, Sep 23, 2016 at 1:58 AM, Ken Gaillot <kgaillot at redhat.com
>> <mailto:kgaillot at redhat.com>> wrote:
>>
>>     On 09/22/2016 09:53 AM, Jan Pokorný wrote:
>>     > On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote:
>>     >> Ken Gaillot <kgaillot at redhat.com <mailto:kgaillot at redhat.com>> writes:
>>     >>
>>     >>> I'm not saying it's a bad idea, just that it's more complicated than it
>>     >>> first sounds, so it's worth thinking through the implications.
>>     >>
>>     >> Thinking about it and looking at how complicated it gets, maybe what
>>     >> you'd really want, to make it clearer for the user, is the ability to
>>     >> explicitly configure the behavior, either globally or per-resource. So
>>     >> instead of having to tweak a set of variables that interact in complex
>>     >> ways, you'd configure something like rule expressions,
>>     >>
>>     >> <on_fail>
>>     >>   <restart repeat="3" />
>>     >>   <migrate timeout="60s" />
>>     >>   <fence/>
>>     >> </on_fail>
>>     >>
>>     >> So, try to restart the service 3 times, if that fails migrate the
>>     >> service, if it still fails, fence the node.
>>     >>
>>     >> (obviously the details and XML syntax are just an example)
>>     >>
>>     >> This would then replace on-fail, migration-threshold, etc.
>>     >
>>     > I must admit that in previous emails in this thread, I wasn't able to
>>     > follow during the first pass, which is not the case with this procedural
>>     > (sequence-ordered) approach.  Though someone can argue it doesn't take
>>     > type of operation into account, which might again open the door for
>>     > non-obvious interactions.
>>
>>     "restart" is the only on-fail value that it makes sense to escalate.
>>
>>     block/stop/fence/standby are final. Block means "don't touch the
>>     resource again", so there can't be any further response to failures.
>>     Stop/fence/standby move the resource off the local node, so failure
>>     handling is reset (there are 0 failures on the new node to begin with).
>>
>>     "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
>>     then migrate", but I can't think of a real-world situation where that
>>     makes sense,
>>
>>
>> really?
>>
>> it is not uncommon to hear "i know its failed, but i dont want the
>> cluster to do anything until its _really_ failed"
>
> Hmm, I guess that would be similar to how monitoring systems such as
> nagios can be configured to send an alert only if N checks in a row
> fail. That's useful where transient outages (e.g. a webserver hitting
> its request limit) are acceptable for a short time.
>
> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
> is not "in a row" but "since the count was last cleared".

It would be a major change, but perhaps it should be "in-a-row" and
successfully performing the action clears the count.
Its entirely possible that the current behaviour is like that because
I wasn't smart enough to implement anything else at the time :-)

>
> "Ignore up to three monitor failures if they occur in a row [or, within
> 10 minutes?], then try soft recovery for the next two monitor failures,
> then ban this node for the next monitor failure." Not sure being able to
> say that is worth the complexity.

Not disagreeing

>
>>
>>     and it would be a significant re-implementation of "ignore"
>>     (which currently ignores the state of having failed, as opposed to a
>>     particular instance of failure).
>>
>>
>> agreed
>>
>>
>>
>>     What the interface needs to express is: "If this operation fails,
>>     optionally try a soft recovery [always stop+start], but if <N> failures
>>     occur on the same node, proceed to a [configurable] hard recovery".
>>
>>     And of course the interface will need to be different depending on how
>>     certain details are decided, e.g. whether any failures count toward <N>
>>     or just failures of one particular operation type, and whether the hard
>>     recovery type can vary depending on what operation failed.