[ClusterLabs] failure-timeout not working in corosync 2.0.1
Reid Wahl
nwahl at redhat.com
Wed Mar 31 16:53:53 EDT 2021
Hi, Antony. failure-timeout should be a resource meta attribute, not an
attribute of the monitor operation. At least I'm not aware of it being
configurable per-operation -- maybe it is. Can't check at the moment :)
On Wednesday, March 31, 2021, Antony Stone <Antony.Stone at ha.open.source.it>
wrote:
> Hi.
>
> I've pared my configureation down to almost a bare minimum to demonstrate
the
> problem I'm having.
>
> I have two questions:
>
> 1. What command can I use to find out what pacemaker thinks my
cluster.cib file
> really means?
>
> I know what I put in it, but I want to see what pacemaker has understood
from
> it, to make sure that pacemaker has the same idea about how to manage my
> resources as I do.
>
>
> 2. Can anyone tell me what the problem is with the following cluster.cib
> (lines split on spaces to make things more readable, the actual file
consists
> of four lines of text):
>
> primitive IP-float4
> IPaddr2
> params
> ip=10.1.0.5
> cidr_netmask=24
> meta
> migration-threshold=3
> op
> monitor
> interval=10
> timeout=30
> on-fail=restart
> failure-timeout=180
> primitive IPsecVPN
> lsb:ipsecwrapper
> meta
> migration-threshold=3
> op
> monitor
> interval=10
> timeout=30
> on-fail=restart
> failure-timeout=180
> group Everything
> IP-float4
> IPsecVPN
> resource-stickiness=100
> property cib-bootstrap-options:
> stonith-enabled=no
> no-quorum-policy=stop
> start-failure-is-fatal=false
> cluster-recheck-interval=60s
>
> My problem is that "failure-timeout" is not being honoured. A resource
> failure simply never times out, and 3 failures (over a fortnight, if
that's
> how long it takes to get 3 failures) mean that the resources move.
>
> I want a failure to be forgotten about after 180 seconds (or at least,
soon
> after that - 240 seconds would be fine, if cluster-recheck-interval means
that
> 180 can't quite be achieved).
>
> Somehow or other, _far_ more than 180 seconds go by, and I *still* have:
>
> fail-count=1 last-failure='Wed Mar 31 21:23:11 2021'
>
> as part of the output of "crm status -f" (the above timestamp is BST, so
> that's 70 minutes ago now).
>
>
> Thanks for any help,
>
>
> Antony.
>
> --
> Don't procrastinate - put it off until tomorrow.
>
> Please reply to the
list;
> please *don't*
CC me.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
--
Regards,
Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210331/4a306002/attachment.htm>
More information about the Users
mailing list