[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

Fri Sep 30 02:28:15 CEST 2016

On 09/28/2016 10:54 PM, Andrew Beekhof wrote:
> On Sat, Sep 24, 2016 at 9:12 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>>>     "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
>>>     then migrate", but I can't think of a real-world situation where that
>>>     makes sense,
>>>
>>>
>>> really?
>>>
>>> it is not uncommon to hear "i know its failed, but i dont want the
>>> cluster to do anything until its _really_ failed"
>>
>> Hmm, I guess that would be similar to how monitoring systems such as
>> nagios can be configured to send an alert only if N checks in a row
>> fail. That's useful where transient outages (e.g. a webserver hitting
>> its request limit) are acceptable for a short time.
>>
>> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
>> is not "in a row" but "since the count was last cleared".
> 
> It would be a major change, but perhaps it should be "in-a-row" and
> successfully performing the action clears the count.
> Its entirely possible that the current behaviour is like that because
> I wasn't smart enough to implement anything else at the time :-)

Or you were smart enough to realize what a can of worms it is. :) Take a
look at all of nagios' options for deciding when a failure becomes "real".

If you clear failures after a success, you can't detect/recover a
resource that is flapping.

>> "Ignore up to three monitor failures if they occur in a row [or, within
>> 10 minutes?], then try soft recovery for the next two monitor failures,
>> then ban this node for the next monitor failure." Not sure being able to
>> say that is worth the complexity.
> 
> Not disagreeing

It only makes sense to escalate from ignore -> restart -> hard, so maybe
something like:

  op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban

To express current default behavior:

  op start ignore-fail=0 soft-fail=0        on-hard-fail=ban
  op stop  ignore-fail=0 soft-fail=0        on-hard-fail=fence
  op *     ignore-fail=0 soft-fail=INFINITY on-hard-fail=ban

on-fail, migration-threshold, and start-failure-is-fatal would be
deprecated (and would be easy to map to the new parameters).

I'd avoid the hassles of counting failures "in a row", and stick with
counting failures since the last cleanup.