[ClusterLabs] cluster-recheck-interval and failure-timeout

Wed Mar 31 12:01:05 EDT 2021

On Wed, 2021-03-31 at 17:38 +0200, Antony Stone wrote:
> On Wednesday 31 March 2021 at 16:58:30, Antony Stone wrote:
> 
> > I'm only interested in the most recent failure.  I'm saying that
> > once that
> > failure is more than "failure-timeout" seconds old, I want the fact
> > that
> > the resource failed to be forgotten, so that it can be restarted or
> > moved
> > between nodes as normal, and not either be moved to another node 

Ah, then yes, that's how it works.

I thought you wanted older failures to expire as they aged, reducing
the total failure count.

> > just
> > because (a) there were two failures last Friday and then one today,
> > or (b)
> > get stuck and not run on any nodes at all because all three nodes
> > had
> > three failures sometime in the past month.
> 
> I've just confirmed that this is working as expected on pacemaker
> 1.1.16 
> (Debian 9) and is not working on pacemaker 2.0.1 (Debian 10).
> 
> I have one cluster of 3 machines running pacemaker 1.1.16 and I have
> another 
> cluster of 3 machines running pacemaker 2.0.1
> 
> They are both running the same set of resources.
> 
> I just deliberately killed the same resource on each cluster, and
> sure enough 
> "crm status -f" on both told me it had a fail-count of 1, with a
> last-failure 
> timestamp.
> 
> I waited 5 minutes (well above my failure-timeout value) and asked
> for "crm 
> status -f" again.
> 
> On pacemaker 1.1.16 there was simply a list of resources; no mention
> of 
> failures.  Just what I want.
> 
> On pacemaker 2.0.1 there was a list of resources plus a fail-count=1
> and a 
> last-failure timestamp of 5 minutes earlier.

That sounds like a bug in the Debian port. I'm not aware of any
relevant bugs reported upstream.

> To be sure I'm not being impatient, I've left it an hour (I did this
> test 
> eariler, while I was still trying to understand the timing
> interactions) and 
> the fail-count does not go away.
> 
> 
> Does anyone have suggestions on how to debug this difference in
> behaviour 
> between pacemaker 1.1.16 and 2.0.1, because at present it prevents me
> being 
> able to upgrade an operational cluster, as the result is simply
> unusable.
> 
> 
> Thanks,
> 
> 
> Antony.
> 
-- 
Ken Gaillot <kgaillot at redhat.com>