[ClusterLabs] short circuiting the corosync token timeout
ccaulfie at redhat.com
Mon Aug 13 08:42:47 UTC 2018
On 13/08/18 09:00, Jan Friesse wrote:
> Chris Walker napsal(a):
>> Before Pacemaker can declare a node as 'offline', the Corosync layer
>> must first declare that the node is no longer part of the cluster
>> after waiting a full token timeout. For example, if I manually
>> STONITH a node with 'crm -F node fence node2', even if the fence
>> operation happens immediately, Corosync will still wait the full token
>> timeout before communicating to Pacemaker that node2 is offline.
>> There are scenarios where it would be advantageous to short circuit
>> the Corosync token timeout since we know that a node is offline. For
>> example, if a node crashes and dumps a vmcore, it sends out packets
>> indicating that it's safely offline. Or if a node is physically
>> removed from a chassis and an event is sent indicating that the node
>> is physically gone. In these cases, there's no need to wait the full
>> token timeout; it would be best to declare the node unclean, STONITH
>> it, and move resources.
>> Has anyone dealt with a scenario like this? I have a version of
>> Corosync with a parameter that effectively expires the token and
>> forces the cluster to reconfigure, but this seems a bit heavy handed
>> and I'm wondering if there's a better way of going about this.
> I'm not aware of such functionality. Closest you can get right now is to
> shutdown (cleanly) one of the nodes, this will force corosync to create
> new membership.
> Anyway, I've filled GH issue
I'm intrigued as you why the token timeout is so long that it's quicker
to do a manual intervention than simply wait for it to expire?
Some of the earlier implementations of qdiskd and multipath required
long timeouts (though still only in the realm of 30 to 60 seconds) but I
thought even those had been fixed.
Bear in mind that this is a potentially dangerous operation so that any
'official' implementation will require the user to confirm their
intention - thus making it an even more time-consuming process.
More information about the Users