[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.
Klaus Wenninger
kwenning at redhat.com
Thu Sep 29 02:46:38 EDT 2016
On 09/29/2016 05:57 AM, Andrew Beekhof wrote:
> On Mon, Sep 26, 2016 at 7:39 PM, Klaus Wenninger <kwenning at redhat.com> wrote:
>> On 09/24/2016 01:12 AM, Ken Gaillot wrote:
>>> On 09/22/2016 05:58 PM, Andrew Beekhof wrote:
>>>> On Fri, Sep 23, 2016 at 1:58 AM, Ken Gaillot <kgaillot at redhat.com
>>>> <mailto:kgaillot at redhat.com>> wrote:
>>>>
>>>> On 09/22/2016 09:53 AM, Jan Pokorný wrote:
>>>> > On 22/09/16 08:42 +0200, Kristoffer Grönlund wrote:
>>>> >> Ken Gaillot <kgaillot at redhat.com <mailto:kgaillot at redhat.com>> writes:
>>>> >>
>>>> >>> I'm not saying it's a bad idea, just that it's more complicated than it
>>>> >>> first sounds, so it's worth thinking through the implications.
>>>> >>
>>>> >> Thinking about it and looking at how complicated it gets, maybe what
>>>> >> you'd really want, to make it clearer for the user, is the ability to
>>>> >> explicitly configure the behavior, either globally or per-resource. So
>>>> >> instead of having to tweak a set of variables that interact in complex
>>>> >> ways, you'd configure something like rule expressions,
>>>> >>
>>>> >> <on_fail>
>>>> >> <restart repeat="3" />
>>>> >> <migrate timeout="60s" />
>>>> >> <fence/>
>>>> >> </on_fail>
>>>> >>
>>>> >> So, try to restart the service 3 times, if that fails migrate the
>>>> >> service, if it still fails, fence the node.
>>>> >>
>>>> >> (obviously the details and XML syntax are just an example)
>>>> >>
>>>> >> This would then replace on-fail, migration-threshold, etc.
>>>> >
>>>> > I must admit that in previous emails in this thread, I wasn't able to
>>>> > follow during the first pass, which is not the case with this procedural
>>>> > (sequence-ordered) approach. Though someone can argue it doesn't take
>>>> > type of operation into account, which might again open the door for
>>>> > non-obvious interactions.
>>>>
>>>> "restart" is the only on-fail value that it makes sense to escalate.
>>>>
>>>> block/stop/fence/standby are final. Block means "don't touch the
>>>> resource again", so there can't be any further response to failures.
>>>> Stop/fence/standby move the resource off the local node, so failure
>>>> handling is reset (there are 0 failures on the new node to begin with).
>>>>
>>>> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
>>>> then migrate", but I can't think of a real-world situation where that
>>>> makes sense,
>>>>
>>>>
>>>> really?
>>>>
>>>> it is not uncommon to hear "i know its failed, but i dont want the
>>>> cluster to do anything until its _really_ failed"
>>> Hmm, I guess that would be similar to how monitoring systems such as
>>> nagios can be configured to send an alert only if N checks in a row
>>> fail. That's useful where transient outages (e.g. a webserver hitting
>>> its request limit) are acceptable for a short time.
>>>
>>> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
>>> is not "in a row" but "since the count was last cleared".
>>>
>>> "Ignore up to three monitor failures if they occur in a row [or, within
>>> 10 minutes?], then try soft recovery for the next two monitor failures,
>>> then ban this node for the next monitor failure." Not sure being able to
>>> say that is worth the complexity.
>> That is the reason why I suggested to think of a solution that
>> comes up with a certain number of statistics in environment
>> variables and leaves the final logic to be scripted in the RA
>> or an additional script.
> I don't think you want to go down that path.
> Otherwise you'll end up re-implementing parts of the PE in the agents.
>
> They'll want to know which nodes are available, what their scores are,
> what other services are on them, how many times have things failed
> there, etc etc. It will be never-ending
Rather replacing one possibly never ending wish-list by the other ;-)
>
>>>> and it would be a significant re-implementation of "ignore"
>>>> (which currently ignores the state of having failed, as opposed to a
>>>> particular instance of failure).
>>>>
>>>>
>>>> agreed
>>>>
>>>>
>>>>
>>>> What the interface needs to express is: "If this operation fails,
>>>> optionally try a soft recovery [always stop+start], but if <N> failures
>>>> occur on the same node, proceed to a [configurable] hard recovery".
>>>>
>>>> And of course the interface will need to be different depending on how
>>>> certain details are decided, e.g. whether any failures count toward <N>
>>>> or just failures of one particular operation type, and whether the hard
>>>> recovery type can vary depending on what operation failed.
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list