[ClusterLabs] cluster-recheck-interval and failure-timeout

Antony Stone Antony.Stone at ha.open.source.it
Wed Mar 31 08:32:32 EDT 2021


Hi.

I'm trying to understand what looks to me like incorrect behaviour between 
cluster-recheck-interval and failure-timeout, under pacemaker 2.0.1

I have three machines in a corosync (3.0.1 if it matters) cluster, managing 12 
resources in a single group.

I'm following documentation from:

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/
Pacemaker_Explained/s-cluster-options.html

and

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/
Pacemaker_Explained/s-resource-options.html

I have set a cluster property:

	cluster-recheck-interval=60s

I have set a resource property:

	failure-timeout=180

The docs say failure-timeout is "How many seconds to wait before acting as if 
the failure had not occurred, and potentially allowing the resource back to 
the node on which it failed."

I think this should mean that if the resource fails and gets restarted, the 
fact that it failed will be "forgotten" after 180 seconds (or maybe a little 
longer, depending on exactly when the next cluster recheck is done).

However what I'm seeing is that if the resource fails and gets restarted, and 
this then happens an hour later, it's still counted as two failures.  If it 
fails and gets restarted another hour after that, it's recorded as three 
failures and (because I have "migration-threshold=3") it gets moved to another 
node (and therefore all the other resources in group are moved as well).

So, what am I misunderstanding about "failure-timeout", and what configuration 
setting do I need to use to tell pacemaker that "provided the resource hasn't 
failed within the past X seconds, forget the fact that it failed more than X 
seconds ago"?


Thanks,


Antony.

-- 
The first fifty percent of an engineering project takes ninety percent of the 
time, and the remaining fifty percent takes another ninety percent of the time.

                                                   Please reply to the list;
                                                         please *don't* CC me.


More information about the Users mailing list