[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

Thu Sep 29 03:57:14 UTC 2016

On Mon, Sep 26, 2016 at 7:39 PM, Klaus Wenninger <kwenning at redhat.com> wrote:
> On 09/24/2016 01:12 AM, Ken Gaillot wrote:
>> On 09/22/2016 05:58 PM, Andrew Beekhof wrote:
>>>
>>> On Fri, Sep 23, 2016 at 1:58 AM, Ken Gaillot <kgaillot at redhat.com
>>> <mailto:kgaillot at redhat.com>> wrote:
>>>
>>>     On 09/22/2016 09:53 AM, Jan Pokorný wrote:
>>>     > On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote:
>>>     >> Ken Gaillot <kgaillot at redhat.com <mailto:kgaillot at redhat.com>> writes:
>>>     >>
>>>     >>> I'm not saying it's a bad idea, just that it's more complicated than it
>>>     >>> first sounds, so it's worth thinking through the implications.
>>>     >>
>>>     >> Thinking about it and looking at how complicated it gets, maybe what
>>>     >> you'd really want, to make it clearer for the user, is the ability to
>>>     >> explicitly configure the behavior, either globally or per-resource. So
>>>     >> instead of having to tweak a set of variables that interact in complex
>>>     >> ways, you'd configure something like rule expressions,
>>>     >>
>>>     >> <on_fail>
>>>     >>   <restart repeat="3" />
>>>     >>   <migrate timeout="60s" />
>>>     >>   <fence/>
>>>     >> </on_fail>
>>>     >>
>>>     >> So, try to restart the service 3 times, if that fails migrate the
>>>     >> service, if it still fails, fence the node.
>>>     >>
>>>     >> (obviously the details and XML syntax are just an example)
>>>     >>
>>>     >> This would then replace on-fail, migration-threshold, etc.
>>>     >
>>>     > I must admit that in previous emails in this thread, I wasn't able to
>>>     > follow during the first pass, which is not the case with this procedural
>>>     > (sequence-ordered) approach.  Though someone can argue it doesn't take
>>>     > type of operation into account, which might again open the door for
>>>     > non-obvious interactions.
>>>
>>>     "restart" is the only on-fail value that it makes sense to escalate.
>>>
>>>     block/stop/fence/standby are final. Block means "don't touch the
>>>     resource again", so there can't be any further response to failures.
>>>     Stop/fence/standby move the resource off the local node, so failure
>>>     handling is reset (there are 0 failures on the new node to begin with).
>>>
>>>     "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
>>>     then migrate", but I can't think of a real-world situation where that
>>>     makes sense,
>>>
>>>
>>> really?
>>>
>>> it is not uncommon to hear "i know its failed, but i dont want the
>>> cluster to do anything until its _really_ failed"
>> Hmm, I guess that would be similar to how monitoring systems such as
>> nagios can be configured to send an alert only if N checks in a row
>> fail. That's useful where transient outages (e.g. a webserver hitting
>> its request limit) are acceptable for a short time.
>>
>> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
>> is not "in a row" but "since the count was last cleared".
>>
>> "Ignore up to three monitor failures if they occur in a row [or, within
>> 10 minutes?], then try soft recovery for the next two monitor failures,
>> then ban this node for the next monitor failure." Not sure being able to
>> say that is worth the complexity.
> That is the reason why I suggested to think of a solution that
> comes up with a certain number of statistics in environment
> variables and leaves the final logic to be scripted in the RA
> or an additional script.

I don't think you want to go down that path.
Otherwise you'll end up re-implementing parts of the PE in the agents.

They'll want to know which nodes are available, what their scores are,
what other services are on them, how many times have things failed
there, etc etc. It will be never-ending

>>>     and it would be a significant re-implementation of "ignore"
>>>     (which currently ignores the state of having failed, as opposed to a
>>>     particular instance of failure).
>>>
>>>
>>> agreed
>>>
>>>
>>>
>>>     What the interface needs to express is: "If this operation fails,
>>>     optionally try a soft recovery [always stop+start], but if <N> failures
>>>     occur on the same node, proceed to a [configurable] hard recovery".
>>>
>>>     And of course the interface will need to be different depending on how
>>>     certain details are decided, e.g. whether any failures count toward <N>
>>>     or just failures of one particular operation type, and whether the hard
>>>     recovery type can vary depending on what operation failed.
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org