[ClusterLabs] temporary loss of quorum when member starts to rejoin

Sherrard Burton sb-clusterlabs at allafrica.com
Tue Apr 7 18:48:45 EDT 2020



On 4/7/20 4:09 AM, Jan Friesse wrote:
> Sherrard and Andrei
>>
>>
>> On 4/6/20 4:10 PM, Andrei Borzenkov wrote:
>>> 06.04.2020 20:57, Sherrard Burton пишет:
>>>
>>> It looks like some timing issue or race condition. After reboot node
>>> manages to contact qnetd first, before connection to other node is
>>> established. Qnetd behaves as documented - it sees two equal size
>>> partitions and favors the partition that includes tie breaker (lowest
>>> node id). So existing node goes out of quorum. Second later both nodes
>>> see each other and so quorum is regained.
> 
> Nice catch
> 
>>
>>
>> thank you for taking the time to troll through my debugging output. 
>> your explanation seems to accurately describe what i am experiencing. 
>> of course i have no idea how to remedy it. :-)
> 
> It is really quite a problem. Honestly, I don't think there is really a 
> way how to remedy this behavior other than implement option to prefer 
> active partition as a tie-breaker 
> (https://github.com/corosync/corosync-qdevice/issues/7).

Jan,
my curiosity got the best of me, so i spent some time trying to orient 
myself to the inner workings of ffsplit.

a) how would one identify the current active partition? i might be 
starting too far in (or missing something), but it seems that by the 
time we are in qnetd_algo_ffsplit_partition_cmp(), we are comparing two 
sets of clients and node lists without the kind of context that would 
allow us to identify the current active partition. i could not easily 
identify the object that we would interrogate to answer that question.

b) is it possible to manage client->tie_breaker.mode and 
client->tie_breaker.node_id dynamically to achieve the desired goal? ie, 
if we are in a two-node cluster and one node leaves, can we "push" 
values to the remaining client such that client->tie_breaker.mode == 
TLV_TIE_BREAKER_MODE_NODE_ID and client->tie_breaker.node_id == 
client->node_id?

of course i may be way off base with all of this. just wanted to ask 
before i extracted myself from the rabbit hole.


More information about the Users mailing list