[ClusterLabs] Antw: Re: Antw: [EXT] Re: cluster-recheck-interval and failure-timeout
Antony Stone
Antony.Stone at ha.open.source.it
Wed Apr 7 04:59:55 EDT 2021
On Wednesday 07 April 2021 at 10:40:54, Ulrich Windl wrote:
> >>> Ken Gaillot <kgaillot at redhat.com> schrieb am 06.04.2021 um 15:58
> > On Tue, 2021-04-06 at 09:15 +0200, Ulrich Windl wrote:
> >> Sorry I don't get it: If you have a timestamp for each failure-
> >> timeout, what's so hard to put all the fail counts that are older than
> >> failure-timeout on a list, and then reset that list to zero?
> >
> > That's exactly the issue -- we don't have a timestamp for each failure.
> > Only the most recent failed operation, and the total fail count (per
> > resource and operation), are stored in the CIB status.
> >
> > We could store all failures in the CIB, but that would be a significant
> > project, and we'd need new options to keep the current behavior as the
> > default.
>
> I still don't quite get it: Some failing operation increases the
> fail-count, and the time stamp for the failing operation is recorded
> (crm_mon can display it). So solving this problem (saving the last time
> for each fail count) doesn't look so hard to do.
For the avoidance of doubt, I (who started this thread) have solved my problem
by following the advice from Reid Wahl - I was putting the "failure-timeout"
parameter into the incorrect section of mt resource definition. Moving it to
the "meta" section has resolved my problem.
The way it works now makes completely good sense to me:
1. A failure happens, and gets corrected.
2. Provided no further failure of that resource occurs within the failure-
timeout setting, the failure gets forgotten about.
3. If a further failure of the resource does occur within failure-timeout, the
original timestamp is discarded, the failure count is incremented, and the
timestamp of the new failure is used to check whether there's another failure
within failure-timeout of *that*
4. If no further failure occurs within failure-timeout of the most recent
failure timestamp, all previous failures are forgotten.
5. If enough failures occur within failure-timeout *of each other* then the
failure count gets incremented to the point where the resource gets moved to
another node.
Regards,
Antony.
--
"It wouldn't be a good idea to talk about him behind his back in front of
him."
- murble
Please reply to the list;
please *don't* CC me.
More information about the Users
mailing list