[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

Wed Oct 5 20:24:08 UTC 2016

On 10/04/2016 05:34 PM, Andrew Beekhof wrote:
> 
> 
> On Wed, Oct 5, 2016 at 7:03 AM, Ken Gaillot <kgaillot at redhat.com
> <mailto:kgaillot at redhat.com>> wrote:
> 
>     On 10/02/2016 10:02 PM, Andrew Beekhof wrote:
>     >> Take a
>     >> look at all of nagios' options for deciding when a failure becomes "real".
>     >
>     > I used to take a very hard line on this: if you don't want the cluster
>     > to do anything about an error, don't tell us about it.
>     > However I'm slowly changing my position... the reality is that many
>     > people do want a heads up in advance and we have been forcing that
>     > policy (when does an error become real) into the agents where one size
>     > must fit all.
>     >
>     > So I'm now generally in favour of having the PE handle this "somehow".
> 
>     Nagios is a useful comparison:
> 
>     check_interval - like pacemaker's monitor interval
> 
>     retry_interval - if a check returns failure, switch to this interval
>     (i.e. check more frequently when trying to decide whether it's a "hard"
>     failure)
> 
>     max_check_attempts - if a check fails this many times in a row, it's a
>     hard failure. Before this is reached, it's considered a soft failure.
>     Nagios will call event handlers (comparable to pacemaker's alert agents)
>     for both soft and hard failures (distinguishing the two). A service is
>     also considered to have a "hard failure" if its host goes down.
> 
>     high_flap_threshold/low_flap_threshold - a service is considered to be
>     flapping when its percent of state changes (ok <-> not ok) in the last
>     21 checks (= max. 20 state changes) reaches high_flap_threshold, and
>     stable again once the percentage drops to low_flap_threshold. To put it
>     another way, a service that passes every monitor is 0% flapping, and a
>     service that fails every other monitor is 100% flapping. With these,
>     even if a service never reaches max_check_attempts failures in a row, an
>     alert can be sent if it's repeatedly failing and recovering.
> 
> 
> makes sense.
> 
> since we're overhauling this functionality anyway, do you think we need
> to add an equivalent of retry_interval too?

It only makes sense if we switch to "in-a-row" failure counting, in
which case we'd need to add flap detection as well ... probably a bigger
project than desired right now :)

>     >> If you clear failures after a success, you can't detect/recover a
>     >> resource that is flapping.
>     >
>     > Ah, but you can if the thing you're clearing only applies to other
>     > failures of the same action.
>     > A completed start doesn't clear a previously failed monitor.
> 
>     Nope -- a monitor can alternately succeed and fail repeatedly, and that
>     indicates a problem, but wouldn't trip an "N-failures-in-a-row" system.
> 
>     >> It only makes sense to escalate from ignore -> restart -> hard, so maybe
>     >> something like:
>     >>
>     >>   op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban
>     >>
>     > I would favour something more concrete than 'soft' and 'hard' here.
>     > Do they have a sufficiently obvious meaning outside of us developers?
>     >
>     > Perhaps (with or without a "failures-" prefix) :
>     >
>     >    ignore-count
>     >    recover-count
>     >    escalation-policy
> 
>     I think the "soft" vs "hard" terminology is somewhat familiar to
>     sysadmins -- there's at least nagios, email (SPF failures and bounces),
>     and ECC RAM. But throwing "ignore" into the mix does confuse things.
> 
>     How about ... max-fail-ignore=3 max-fail-restart=2 fail-escalation=ban
> 
> 
> I could live with that :-)

OK, that will be the tentative plan, subject to further discussion of
course. There's a lot on the plate right now, so there's plenty of time
to refine it :)