[ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?
Jan Friesse
jfriesse at redhat.com
Fri Mar 22 03:57:20 EDT 2019
Brian,
> I've followed several tutorials about setting up a simple three-node
> cluster, with no resources (yet), under CentOS 7.
>
> I've discovered the cluster won't restart upon rebooting a node.
>
> The other two nodes, however, do claim the cluster is up, as shown
> with 'pcs status cluster'.
>
> I tracked down that on the rebooted node, corosync exited with a
> '0' status. Nothing outright seems to be what I would call an error
> message, but this was recorded:
>
> [MAIN ] Corosync main process was not scheduled for 2145.7053
> ms (threshold is 1320.0000 ms). Consider token timeout increase.
>
> This seems related:
>
> https://access.redhat.com/solutions/1217663
>
> High Availability cluster node logs the message "Corosync main
> process was not scheduled for X ms (threshold is Y ms). Consider
> token timeout increase."
>
> I've confirmed that corosync is running with the maximum realtime
> scheduling priority:
>
> [root at node1 ~]# ps -eo cmd,rtprio | grep -e [c]orosync -e RTPRIO
> CMD RTPRIO
> corosync 99
>
> I am doing my testing in an admittedly underprovisioned VM environment.
>
> I've used this same environment for CentOS 6 / heartbeat-based
> solutions, and they were nowhere near as sensitive to these timing
> issues.
>
> Manually running 'pcs cluster start' does indeed fire everything
> up without a hitch, and remains running for days at a crack.
>
> The 'consider token timeout increase' message has me looking at this:
>
> https://access.redhat.com/solutions/221263
>
> Which makes this assertion:
>
> RHEL 7 or 8
>
> If no token value is specified in the corosync configuration, the
> default is 1000 ms, or 1 second for a 2 node cluster, increasing
> by 650ms for each additional member.
>
> I have a three-node cluster, and the arithmetic for totem.token
> seems to hold:
>
> [root at node3 ~]# corosync-cmapctl | grep totem.token
> runtime.config.totem.token (u32) = 1650
> runtime.config.totem.token_retransmit (u32) = 392
> runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
>
> I'm confused on a number of issues:
>
> - The 'totem.token' value of 1650 doesn't seem to related to the
> threshold number in the diagnostic message the corosync service
> logged:
>
> threshold is 1320.0000 ms
>
> Can someone explain the relationship between these values?
Yes. Threshold is 80% of used token timeout.
>
> - If I manually set 'totem.token' to a higher value, am I responsible
> for tracking the number of nodes in the cluster, to keep in
> alignment with what Red Hat's page says?
Nope. I've tried to explain what is really happening in the manpage
corosync.conf(5). totem.token and totem.token_coefficient are used in
the following formula:
runtime.config.token = totem.token + (number_of_nodes - 2) *
totem.token_coefficient
Corosync used runtime.config.token.
>
> - Under these conditions, when corosync exits, why does it do so
> with a zero status? It seems to me that if it exited at all,
That's a good question. How reproducible is the issue? Corosync
shouldn't "exit" with zero status.
> without someone controllably stopping the service, it warrants a
> non-zero status.
>
> - Is there a recommended way to alter either pacemaker/corosync or
> systemd configuration of these services to harden against resource
> issues?
Enlarging timeout seems like a right way to go.
>
> I don't know if corosync's startup can be deferred until the CPU
> load settles, or if the some automatic retry can be set up...
This seems more like a init system question.
Regards,
Honza
>
> Details of my environment; I'm happy to provide others, if anyone
> has any specific questions:
>
> [root at node1 ~]# cat /etc/centos-release
> CentOS Linux release 7.6.1810 (Core)
> [root at node1 ~]# rpm -qa | egrep 'pacemaker|corosync'
> corosynclib-2.4.3-4.el7.x86_64
> pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64
> corosync-2.4.3-4.el7.x86_64
> pacemaker-cli-1.1.19-8.el7_6.4.x86_64
> pacemaker-1.1.19-8.el7_6.4.x86_64
> pacemaker-libs-1.1.19-8.el7_6.4.x86_64
>
More information about the Users
mailing list