[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.
Ken Gaillot
kgaillot at redhat.com
Thu Sep 29 20:28:15 EDT 2016
On 09/28/2016 10:54 PM, Andrew Beekhof wrote:
> On Sat, Sep 24, 2016 at 9:12 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>>> "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
>>> then migrate", but I can't think of a real-world situation where that
>>> makes sense,
>>>
>>>
>>> really?
>>>
>>> it is not uncommon to hear "i know its failed, but i dont want the
>>> cluster to do anything until its _really_ failed"
>>
>> Hmm, I guess that would be similar to how monitoring systems such as
>> nagios can be configured to send an alert only if N checks in a row
>> fail. That's useful where transient outages (e.g. a webserver hitting
>> its request limit) are acceptable for a short time.
>>
>> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
>> is not "in a row" but "since the count was last cleared".
>
> It would be a major change, but perhaps it should be "in-a-row" and
> successfully performing the action clears the count.
> Its entirely possible that the current behaviour is like that because
> I wasn't smart enough to implement anything else at the time :-)
Or you were smart enough to realize what a can of worms it is. :) Take a
look at all of nagios' options for deciding when a failure becomes "real".
If you clear failures after a success, you can't detect/recover a
resource that is flapping.
>> "Ignore up to three monitor failures if they occur in a row [or, within
>> 10 minutes?], then try soft recovery for the next two monitor failures,
>> then ban this node for the next monitor failure." Not sure being able to
>> say that is worth the complexity.
>
> Not disagreeing
It only makes sense to escalate from ignore -> restart -> hard, so maybe
something like:
op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban
To express current default behavior:
op start ignore-fail=0 soft-fail=0 on-hard-fail=ban
op stop ignore-fail=0 soft-fail=0 on-hard-fail=fence
op * ignore-fail=0 soft-fail=INFINITY on-hard-fail=ban
on-fail, migration-threshold, and start-failure-is-fatal would be
deprecated (and would be easy to map to the new parameters).
I'd avoid the hassles of counting failures "in a row", and stick with
counting failures since the last cleanup.
More information about the Users
mailing list