[ClusterLabs] failure-timeout not working in corosync 2.0.1

Wed Mar 31 16:53:53 EDT 2021

Hi, Antony. failure-timeout should be a resource meta attribute, not an
attribute of the monitor operation. At least I'm not aware of it being
configurable per-operation -- maybe it is. Can't check at the moment :)

On Wednesday, March 31, 2021, Antony Stone <Antony.Stone at ha.open.source.it>
wrote:
> Hi.
>
> I've pared my configureation down to almost a bare minimum to demonstrate
the
> problem I'm having.
>
> I have two questions:
>
> 1. What command can I use to find out what pacemaker thinks my
cluster.cib file
> really means?
>
> I know what I put in it, but I want to see what pacemaker has understood
from
> it, to make sure that pacemaker has the same idea about how to manage my
> resources as I do.
>
>
> 2. Can anyone tell me what the problem is with the following cluster.cib
> (lines split on spaces to make things more readable, the actual file
consists
> of four lines of text):
>
> primitive IP-float4
>         IPaddr2
>         params
>         ip=10.1.0.5
>         cidr_netmask=24
>         meta
>         migration-threshold=3
>         op
>         monitor
>         interval=10
>         timeout=30
>         on-fail=restart
>         failure-timeout=180
> primitive IPsecVPN
>         lsb:ipsecwrapper
>         meta
>         migration-threshold=3
>         op
>         monitor
>         interval=10
>         timeout=30
>         on-fail=restart
>         failure-timeout=180
> group Everything
>         IP-float4
>         IPsecVPN
>         resource-stickiness=100
> property cib-bootstrap-options:
>         stonith-enabled=no
>         no-quorum-policy=stop
>         start-failure-is-fatal=false
>         cluster-recheck-interval=60s
>
> My problem is that "failure-timeout" is not being honoured.  A resource
> failure simply never times out, and 3 failures (over a fortnight, if
that's
> how long it takes to get 3 failures) mean that the resources move.
>
> I want a failure to be forgotten about after 180 seconds (or at least,
soon
> after that - 240 seconds would be fine, if cluster-recheck-interval means
that
> 180 can't quite be achieved).
>
> Somehow or other, _far_ more than 180 seconds go by, and I *still* have:
>
>         fail-count=1 last-failure='Wed Mar 31 21:23:11 2021'
>
> as part of the output of "crm status -f" (the above timestamp is BST, so
> that's 70 minutes ago now).
>
>
> Thanks for any help,
>
>
> Antony.
>
> --
> Don't procrastinate - put it off until tomorrow.
>
>                                                    Please reply to the
list;
>                                                          please *don't*
CC me.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Senior Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210331/4a306002/attachment.htm>