[ClusterLabs] short circuiting the corosync token timeout

Chris Walker cwalker at cray.com
Sat Aug 11 03:05:41 UTC 2018


Hello,

Before Pacemaker can declare a node as 'offline', the Corosync layer 
must first declare that the node is no longer part of the cluster after 
waiting a full token timeout.  For example, if I manually STONITH a node 
with 'crm -F node fence node2', even if the fence operation happens 
immediately, Corosync will still wait the full token timeout before 
communicating to Pacemaker that node2 is offline.

There are scenarios where it would be advantageous to short circuit the 
Corosync token timeout since we know that a node is offline. For 
example, if a node crashes and dumps a vmcore, it sends out packets 
indicating that it's safely offline.  Or if a node is physically removed 
from a chassis and an event is sent indicating that the node is 
physically gone.  In these cases, there's no need to wait the full token 
timeout; it would be best to declare the node unclean, STONITH it, and 
move resources.

Has anyone dealt with a scenario like this?  I have a version of 
Corosync with a parameter that effectively expires the token and forces 
the cluster to reconfigure, but this seems a bit heavy handed and I'm 
wondering if there's a better way of going about this.

Thanks!
Chris


More information about the Users mailing list