[ClusterLabs] short circuiting the corosync token timeout
Chris Walker
cwalker at cray.com
Fri Aug 10 23:05:41 EDT 2018
Hello,
Before Pacemaker can declare a node as 'offline', the Corosync layer
must first declare that the node is no longer part of the cluster after
waiting a full token timeout. For example, if I manually STONITH a node
with 'crm -F node fence node2', even if the fence operation happens
immediately, Corosync will still wait the full token timeout before
communicating to Pacemaker that node2 is offline.
There are scenarios where it would be advantageous to short circuit the
Corosync token timeout since we know that a node is offline. For
example, if a node crashes and dumps a vmcore, it sends out packets
indicating that it's safely offline. Or if a node is physically removed
from a chassis and an event is sent indicating that the node is
physically gone. In these cases, there's no need to wait the full token
timeout; it would be best to declare the node unclean, STONITH it, and
move resources.
Has anyone dealt with a scenario like this? I have a version of
Corosync with a parameter that effectively expires the token and forces
the cluster to reconfigure, but this seems a bit heavy handed and I'm
wondering if there's a better way of going about this.
Thanks!
Chris
More information about the Users
mailing list