[ClusterLabs] failure-timeout not working in corosync 2.0.1
Antony Stone
Antony.Stone at ha.open.source.it
Wed Mar 31 15:39:29 EDT 2021
Hi.
I've pared my configureation down to almost a bare minimum to demonstrate the
problem I'm having.
I have two questions:
1. What command can I use to find out what pacemaker thinks my cluster.cib file
really means?
I know what I put in it, but I want to see what pacemaker has understood from
it, to make sure that pacemaker has the same idea about how to manage my
resources as I do.
2. Can anyone tell me what the problem is with the following cluster.cib
(lines split on spaces to make things more readable, the actual file consists
of four lines of text):
primitive IP-float4
IPaddr2
params
ip=10.1.0.5
cidr_netmask=24
meta
migration-threshold=3
op
monitor
interval=10
timeout=30
on-fail=restart
failure-timeout=180
primitive IPsecVPN
lsb:ipsecwrapper
meta
migration-threshold=3
op
monitor
interval=10
timeout=30
on-fail=restart
failure-timeout=180
group Everything
IP-float4
IPsecVPN
resource-stickiness=100
property cib-bootstrap-options:
stonith-enabled=no
no-quorum-policy=stop
start-failure-is-fatal=false
cluster-recheck-interval=60s
My problem is that "failure-timeout" is not being honoured. A resource
failure simply never times out, and 3 failures (over a fortnight, if that's
how long it takes to get 3 failures) mean that the resources move.
I want a failure to be forgotten about after 180 seconds (or at least, soon
after that - 240 seconds would be fine, if cluster-recheck-interval means that
180 can't quite be achieved).
Somehow or other, _far_ more than 180 seconds go by, and I *still* have:
fail-count=1 last-failure='Wed Mar 31 21:23:11 2021'
as part of the output of "crm status -f" (the above timestamp is BST, so
that's 70 minutes ago now).
Thanks for any help,
Antony.
--
Don't procrastinate - put it off until tomorrow.
Please reply to the list;
please *don't* CC me.
More information about the Users
mailing list