[ClusterLabs] short circuiting the corosync token timeout

Mon Aug 13 08:00:07 UTC 2018

Chris Walker napsal(a):
> Hello,
> 
> Before Pacemaker can declare a node as 'offline', the Corosync layer 
> must first declare that the node is no longer part of the cluster after 
> waiting a full token timeout.  For example, if I manually STONITH a node 
> with 'crm -F node fence node2', even if the fence operation happens 
> immediately, Corosync will still wait the full token timeout before 
> communicating to Pacemaker that node2 is offline.
> 
> There are scenarios where it would be advantageous to short circuit the 
> Corosync token timeout since we know that a node is offline. For 
> example, if a node crashes and dumps a vmcore, it sends out packets 
> indicating that it's safely offline.  Or if a node is physically removed 
> from a chassis and an event is sent indicating that the node is 
> physically gone.  In these cases, there's no need to wait the full token 
> timeout; it would be best to declare the node unclean, STONITH it, and 
> move resources.
> 
> Has anyone dealt with a scenario like this?  I have a version of 
> Corosync with a parameter that effectively expires the token and forces 
> the cluster to reconfigure, but this seems a bit heavy handed and I'm 
> wondering if there's a better way of going about this.

I'm not aware of such functionality. Closest you can get right now is to 
shutdown (cleanly) one of the nodes, this will force corosync to create 
new membership.

Anyway, I've filled GH issue https://github.com/corosync/corosync/issues/366

Honza

> 
> Thanks!
> Chris
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org