[ClusterLabs] failure-timeout not working in corosync 2.0.1

Wed Mar 31 17:52:42 EDT 2021

On Wednesday 31 March 2021 at 23:09:38, Antony Stone wrote:

> On Wednesday 31 March 2021 at 22:53:53, Reid Wahl wrote:
> > Hi, Antony. failure-timeout should be a resource meta attribute, not an
> > attribute of the monitor operation. At least I'm not aware of it being
> > configurable per-operation -- maybe it is. Can't check at the moment :)
> 
> Okay, I'll try moving it - but that still leaves me wondering why it works
> fine in pacemaker 1.1.16 and not in 2.0.1.

*Thank you, Reid*

	It works.

Moving the failure-timeout specification to the "meta" section of the resource 
definition has caused the failures to disappear from "crm status -f" after the 
expected amount of time.

I am sure that this also means the resources are no longer going to move from 
node 1 to node 2 to node 3 and then got totally stuck.

I shall find for sure out by tomorrow (it's nearly midnight where I am now).

I already know what I need to do to stop this particular resource from having 
to be restarted so frequently, but the fact that the 2.0.1 cluster couldn't 
cope with it at all made me nervous about just doing that, and then never 
being confident that the cluster _could_ cope if a resource really needed to be 
restarted several times.

Pacemaker 1.1.16 could cope with the configuration fine, even though I was 
clearly putting failure-timeout into the wrong place in cluster.cib.

Once again, thank you Reid.

Antony.

-- 
What do you get when you cross a joke with a rhetorical question?

                                                   Please reply to the list;
                                                         please *don't* CC me.