[ClusterLabs] recommendations for corosync totem timeout for CentOS 7 + VMware?
Brian Reichert
reichert at numachi.com
Thu Mar 21 12:21:46 EDT 2019
I've followed several tutorials about setting up a simple three-node
cluster, with no resources (yet), under CentOS 7.
I've discovered the cluster won't restart upon rebooting a node.
The other two nodes, however, do claim the cluster is up, as shown
with 'pcs status cluster'.
I tracked down that on the rebooted node, corosync exited with a
'0' status. Nothing outright seems to be what I would call an error
message, but this was recorded:
[MAIN ] Corosync main process was not scheduled for 2145.7053
ms (threshold is 1320.0000 ms). Consider token timeout increase.
This seems related:
https://access.redhat.com/solutions/1217663
High Availability cluster node logs the message "Corosync main
process was not scheduled for X ms (threshold is Y ms). Consider
token timeout increase."
I've confirmed that corosync is running with the maximum realtime
scheduling priority:
[root at node1 ~]# ps -eo cmd,rtprio | grep -e [c]orosync -e RTPRIO
CMD RTPRIO
corosync 99
I am doing my testing in an admittedly underprovisioned VM environment.
I've used this same environment for CentOS 6 / heartbeat-based
solutions, and they were nowhere near as sensitive to these timing
issues.
Manually running 'pcs cluster start' does indeed fire everything
up without a hitch, and remains running for days at a crack.
The 'consider token timeout increase' message has me looking at this:
https://access.redhat.com/solutions/221263
Which makes this assertion:
RHEL 7 or 8
If no token value is specified in the corosync configuration, the
default is 1000 ms, or 1 second for a 2 node cluster, increasing
by 650ms for each additional member.
I have a three-node cluster, and the arithmetic for totem.token
seems to hold:
[root at node3 ~]# corosync-cmapctl | grep totem.token
runtime.config.totem.token (u32) = 1650
runtime.config.totem.token_retransmit (u32) = 392
runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
I'm confused on a number of issues:
- The 'totem.token' value of 1650 doesn't seem to related to the
threshold number in the diagnostic message the corosync service
logged:
threshold is 1320.0000 ms
Can someone explain the relationship between these values?
- If I manually set 'totem.token' to a higher value, am I responsible
for tracking the number of nodes in the cluster, to keep in
alignment with what Red Hat's page says?
- Under these conditions, when corosync exits, why does it do so
with a zero status? It seems to me that if it exited at all,
without someone controllably stopping the service, it warrants a
non-zero status.
- Is there a recommended way to alter either pacemaker/corosync or
systemd configuration of these services to harden against resource
issues?
I don't know if corosync's startup can be deferred until the CPU
load settles, or if the some automatic retry can be set up...
Details of my environment; I'm happy to provide others, if anyone
has any specific questions:
[root at node1 ~]# cat /etc/centos-release
CentOS Linux release 7.6.1810 (Core)
[root at node1 ~]# rpm -qa | egrep 'pacemaker|corosync'
corosynclib-2.4.3-4.el7.x86_64
pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64
corosync-2.4.3-4.el7.x86_64
pacemaker-cli-1.1.19-8.el7_6.4.x86_64
pacemaker-1.1.19-8.el7_6.4.x86_64
pacemaker-libs-1.1.19-8.el7_6.4.x86_64
--
Brian Reichert <reichert at numachi.com>
BSD admin/developer at large
More information about the Users
mailing list