[ClusterLabs] cluster-recheck-interval and failure-timeout

Antony Stone Antony.Stone at ha.open.source.it
Wed Mar 31 11:38:09 EDT 2021


On Wednesday 31 March 2021 at 16:58:30, Antony Stone wrote:

> I'm only interested in the most recent failure.  I'm saying that once that
> failure is more than "failure-timeout" seconds old, I want the fact that
> the resource failed to be forgotten, so that it can be restarted or moved
> between nodes as normal, and not either be moved to another node just
> because (a) there were two failures last Friday and then one today, or (b)
> get stuck and not run on any nodes at all because all three nodes had
> three failures sometime in the past month.

I've just confirmed that this is working as expected on pacemaker 1.1.16 
(Debian 9) and is not working on pacemaker 2.0.1 (Debian 10).

I have one cluster of 3 machines running pacemaker 1.1.16 and I have another 
cluster of 3 machines running pacemaker 2.0.1

They are both running the same set of resources.

I just deliberately killed the same resource on each cluster, and sure enough 
"crm status -f" on both told me it had a fail-count of 1, with a last-failure 
timestamp.

I waited 5 minutes (well above my failure-timeout value) and asked for "crm 
status -f" again.

On pacemaker 1.1.16 there was simply a list of resources; no mention of 
failures.  Just what I want.

On pacemaker 2.0.1 there was a list of resources plus a fail-count=1 and a 
last-failure timestamp of 5 minutes earlier.

To be sure I'm not being impatient, I've left it an hour (I did this test 
eariler, while I was still trying to understand the timing interactions) and 
the fail-count does not go away.


Does anyone have suggestions on how to debug this difference in behaviour 
between pacemaker 1.1.16 and 2.0.1, because at present it prevents me being 
able to upgrade an operational cluster, as the result is simply unusable.


Thanks,


Antony.

-- 
Perfection in design is achieved not when there is nothing left to add, but 
rather when there is nothing left to take away.

 - Antoine de Saint-Exupery

                                                   Please reply to the list;
                                                         please *don't* CC me.


More information about the Users mailing list