[ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

Sun Oct 2 23:02:20 EDT 2016

On Fri, Sep 30, 2016 at 10:28 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
> On 09/28/2016 10:54 PM, Andrew Beekhof wrote:
>> On Sat, Sep 24, 2016 at 9:12 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>>>>     "Ignore" is theoretically possible to escalate, e.g. "ignore 3 failures
>>>>     then migrate", but I can't think of a real-world situation where that
>>>>     makes sense,
>>>>
>>>>
>>>> really?
>>>>
>>>> it is not uncommon to hear "i know its failed, but i dont want the
>>>> cluster to do anything until its _really_ failed"
>>>
>>> Hmm, I guess that would be similar to how monitoring systems such as
>>> nagios can be configured to send an alert only if N checks in a row
>>> fail. That's useful where transient outages (e.g. a webserver hitting
>>> its request limit) are acceptable for a short time.
>>>
>>> I'm not sure that's translatable to Pacemaker. Pacemaker's error count
>>> is not "in a row" but "since the count was last cleared".
>>
>> It would be a major change, but perhaps it should be "in-a-row" and
>> successfully performing the action clears the count.
>> Its entirely possible that the current behaviour is like that because
>> I wasn't smart enough to implement anything else at the time :-)
>
> Or you were smart enough to realize what a can of worms it is. :)

So you're saying two dumbs makes a smart? :-)

>Take a
> look at all of nagios' options for deciding when a failure becomes "real".

I used to take a very hard line on this: if you don't want the cluster
to do anything about an error, don't tell us about it.
However I'm slowly changing my position... the reality is that many
people do want a heads up in advance and we have been forcing that
policy (when does an error become real) into the agents where one size
must fit all.

So I'm now generally in favour of having the PE handle this "somehow".

>
> If you clear failures after a success, you can't detect/recover a
> resource that is flapping.

Ah, but you can if the thing you're clearing only applies to other
failures of the same action.
A completed start doesn't clear a previously failed monitor.

>
>>> "Ignore up to three monitor failures if they occur in a row [or, within
>>> 10 minutes?], then try soft recovery for the next two monitor failures,
>>> then ban this node for the next monitor failure." Not sure being able to
>>> say that is worth the complexity.
>>
>> Not disagreeing
>
> It only makes sense to escalate from ignore -> restart -> hard, so maybe
> something like:
>
>   op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban

The other idea I had, was to create some new return codes:
PCMK_OCF_ERR_BAN, PCMK_OCF_ERR_FENCE, etc.
Ie. make the internal mapping of return codes like
PCMK_OCF_NOT_CONFIGURED and PCMK_OCF_DEGRADED to hard/soft/ignore
recovery logic into something available to the agent.

To use your example above, return PCMK_OCF_DEGRADED for the first 3
monitor failures, PCMK_OCF_ERR_RESTART for the next two and
PCMK_OCF_ERR_BAN for the last.

But the more I think about it, the less I like it.
- We loose precision about what the actual error was
- We're pushing too much user config/policy into the agent (every
agent would end up with equivalents of 'ignore-fail', 'soft-fail', and
'on-hard-fail')
- We might need the agent to know about the fencing config
(enabled/disabled/valid)
- If forces the agent to track the number of operation failures

So I think I'm just mentioning it for completeness and in case it
prompts a good idea in someone else.

>
>
> To express current default behavior:
>
>   op start ignore-fail=0 soft-fail=0        on-hard-fail=ban

I would favour something more concrete than 'soft' and 'hard' here.
Do they have a sufficiently obvious meaning outside of us developers?

Perhaps (with or without a "failures-" prefix) :

   ignore-count
   recover-count
   escalation-policy

>   op stop  ignore-fail=0 soft-fail=0        on-hard-fail=fence
>   op *     ignore-fail=0 soft-fail=INFINITY on-hard-fail=ban
>
>
> on-fail, migration-threshold, and start-failure-is-fatal would be
> deprecated (and would be easy to map to the new parameters).
>
> I'd avoid the hassles of counting failures "in a row", and stick with
> counting failures since the last cleanup.

sure