[ClusterLabs] cluster-recheck-interval and failure-timeout
Ken Gaillot
kgaillot at redhat.com
Wed Mar 31 12:01:05 EDT 2021
On Wed, 2021-03-31 at 17:38 +0200, Antony Stone wrote:
> On Wednesday 31 March 2021 at 16:58:30, Antony Stone wrote:
>
> > I'm only interested in the most recent failure. I'm saying that
> > once that
> > failure is more than "failure-timeout" seconds old, I want the fact
> > that
> > the resource failed to be forgotten, so that it can be restarted or
> > moved
> > between nodes as normal, and not either be moved to another node
Ah, then yes, that's how it works.
I thought you wanted older failures to expire as they aged, reducing
the total failure count.
> > just
> > because (a) there were two failures last Friday and then one today,
> > or (b)
> > get stuck and not run on any nodes at all because all three nodes
> > had
> > three failures sometime in the past month.
>
> I've just confirmed that this is working as expected on pacemaker
> 1.1.16
> (Debian 9) and is not working on pacemaker 2.0.1 (Debian 10).
>
> I have one cluster of 3 machines running pacemaker 1.1.16 and I have
> another
> cluster of 3 machines running pacemaker 2.0.1
>
> They are both running the same set of resources.
>
> I just deliberately killed the same resource on each cluster, and
> sure enough
> "crm status -f" on both told me it had a fail-count of 1, with a
> last-failure
> timestamp.
>
> I waited 5 minutes (well above my failure-timeout value) and asked
> for "crm
> status -f" again.
>
> On pacemaker 1.1.16 there was simply a list of resources; no mention
> of
> failures. Just what I want.
>
> On pacemaker 2.0.1 there was a list of resources plus a fail-count=1
> and a
> last-failure timestamp of 5 minutes earlier.
That sounds like a bug in the Debian port. I'm not aware of any
relevant bugs reported upstream.
> To be sure I'm not being impatient, I've left it an hour (I did this
> test
> eariler, while I was still trying to understand the timing
> interactions) and
> the fail-count does not go away.
>
>
> Does anyone have suggestions on how to debug this difference in
> behaviour
> between pacemaker 1.1.16 and 2.0.1, because at present it prevents me
> being
> able to upgrade an operational cluster, as the result is simply
> unusable.
>
>
> Thanks,
>
>
> Antony.
>
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list