[ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?

Fri Mar 22 03:57:20 EDT 2019

Brian,

> I've followed several tutorials about setting up a simple three-node
> cluster, with no resources (yet), under CentOS 7.
> 
> I've discovered the cluster won't restart upon rebooting a node.
> 
> The other two nodes, however, do claim the cluster is up, as shown
> with 'pcs status cluster'.
> 
> I tracked down that on the rebooted node, corosync exited with a
> '0' status.  Nothing outright seems to be what I would call an error
> message, but this was recorded:
> 
>    [MAIN  ] Corosync main process was not scheduled for 2145.7053
>    ms (threshold is 1320.0000 ms). Consider token timeout increase.
> 
> This seems related:
> 
>    https://access.redhat.com/solutions/1217663
> 
>    High Availability cluster node logs the message "Corosync main
>    process was not scheduled for X ms (threshold is Y ms). Consider
>    token timeout increase."
> 
> I've confirmed that corosync is running with the maximum realtime
> scheduling priority:
> 
>    [root at node1 ~]# ps -eo cmd,rtprio | grep -e [c]orosync -e RTPRIO
>    CMD                         RTPRIO
>    corosync                        99
> 
> I am doing my testing in an admittedly underprovisioned VM environment.
> 
> I've used this same environment for CentOS 6 / heartbeat-based
> solutions, and they were nowhere near as sensitive to these timing
> issues.
> 
> Manually running 'pcs cluster start' does indeed fire everything
> up without a hitch, and remains running for days at a crack.
> 
> The 'consider token timeout increase' message has me looking at this:
> 
>    https://access.redhat.com/solutions/221263
> 
> Which makes this assertion:
> 
>    RHEL 7 or 8
> 
>    If no token value is specified in the corosync configuration, the
>    default is 1000 ms, or 1 second for a 2 node cluster, increasing
>    by 650ms for each additional member.
> 
> I have a three-node cluster, and the arithmetic for totem.token
> seems to hold:
> 
>    [root at node3 ~]# corosync-cmapctl | grep totem.token
>    runtime.config.totem.token (u32) = 1650
>    runtime.config.totem.token_retransmit (u32) = 392
>    runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
> 
> I'm confused on a number of issues:
> 
> - The 'totem.token' value of 1650 doesn't seem to related to the
>    threshold number in the diagnostic message the corosync service
>    logged:
> 
>      threshold is 1320.0000 ms
> 
>    Can someone explain the relationship between these values?

Yes. Threshold is 80% of used token timeout.

> 
> - If I manually set 'totem.token' to a higher value, am I responsible
>    for tracking the number of nodes in the cluster, to keep in
>    alignment with what Red Hat's page says?

Nope. I've tried to explain what is really happening in the manpage 
corosync.conf(5). totem.token and totem.token_coefficient are used in 
the following formula:

runtime.config.token = totem.token + (number_of_nodes - 2) * 
totem.token_coefficient

Corosync used runtime.config.token.

> 
> - Under these conditions, when corosync exits, why does it do so
>    with a zero status? It seems to me that if it exited at all,

That's a good question. How reproducible is the issue? Corosync 
shouldn't "exit" with zero status.

>    without someone controllably stopping the service, it warrants a
>    non-zero status.
> 
> - Is there a recommended way to alter either pacemaker/corosync or
>    systemd configuration of these services to harden against resource
>    issues?

Enlarging timeout seems like a right way to go.

> 
>    I don't know if corosync's startup can be deferred until the CPU
>    load settles, or if the some automatic retry can be set up...

This seems more like a init system question.

Regards,
   Honza

> 
> Details of my environment; I'm happy to provide others, if anyone
> has any specific questions:
> 
>    [root at node1 ~]# cat /etc/centos-release
>    CentOS Linux release 7.6.1810 (Core)
>    [root at node1 ~]# rpm -qa | egrep 'pacemaker|corosync'
>    corosynclib-2.4.3-4.el7.x86_64
>    pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64
>    corosync-2.4.3-4.el7.x86_64
>    pacemaker-cli-1.1.19-8.el7_6.4.x86_64
>    pacemaker-1.1.19-8.el7_6.4.x86_64
>    pacemaker-libs-1.1.19-8.el7_6.4.x86_64
>