[ClusterLabs] Antw: Re: Antw: [EXT] Re: cluster-recheck-interval and failure-timeout

Wed Apr 7 04:59:55 EDT 2021

On Wednesday 07 April 2021 at 10:40:54, Ulrich Windl wrote:

> >>> Ken Gaillot <kgaillot at redhat.com> schrieb am 06.04.2021 um 15:58
> > On Tue, 2021-04-06 at 09:15 +0200, Ulrich Windl wrote:

> >> Sorry I don't get it: If you have a timestamp for each failure-
> >> timeout, what's so hard to put all the fail counts that are older than
> >> failure-timeout on a list, and then reset that list to zero?
> > 
> > That's exactly the issue -- we don't have a timestamp for each failure.
> > Only the most recent failed operation, and the total fail count (per
> > resource and operation), are stored in the CIB status.
> > 
> > We could store all failures in the CIB, but that would be a significant
> > project, and we'd need new options to keep the current behavior as the
> > default.
> 
> I still don't quite get it: Some failing operation increases the
> fail-count, and the time stamp for the failing operation is recorded
> (crm_mon can display it). So solving this problem (saving the last time
> for each fail count) doesn't look so hard to do.

For the avoidance of doubt, I (who started this thread) have solved my problem 
by following the advice from Reid Wahl - I was putting the "failure-timeout" 
parameter into the incorrect section of mt resource definition.  Moving it to 
the "meta" section has resolved my problem.

The way it works now makes completely good sense to me:

1. A failure happens, and gets corrected.

2. Provided no further failure of that resource occurs within the failure-
timeout setting, the failure gets forgotten about.

3. If a further failure of the resource does occur within failure-timeout, the 
original timestamp is discarded, the failure count is incremented, and the 
timestamp of the new failure is used to check whether there's another failure 
within failure-timeout of *that*

4. If no further failure occurs within failure-timeout of the most recent 
failure timestamp, all previous failures are forgotten.

5. If enough failures occur within failure-timeout *of each other* then the 
failure count gets incremented to the point where the resource gets moved to 
another node.

Regards,

Antony.

-- 
"It wouldn't be a good idea to talk about him behind his back in front of 
him."

 - murble

                                                   Please reply to the list;
                                                         please *don't* CC me.