[ClusterLabs] temporary loss of quorum when member starts to rejoin

Jan Friesse jfriesse at redhat.com
Thu Apr 9 06:37:09 EDT 2020


Sherrard Burton napsal(a):
> 
> 
> On 4/7/20 4:09 AM, Jan Friesse wrote:
>> Sherrard and Andrei
>>>
>>>
>>> On 4/6/20 4:10 PM, Andrei Borzenkov wrote:
>>>> 06.04.2020 20:57, Sherrard Burton пишет:
>>>>
>>>> It looks like some timing issue or race condition. After reboot node
>>>> manages to contact qnetd first, before connection to other node is
>>>> established. Qnetd behaves as documented - it sees two equal size
>>>> partitions and favors the partition that includes tie breaker (lowest
>>>> node id). So existing node goes out of quorum. Second later both nodes
>>>> see each other and so quorum is regained.
>>
>> Nice catch
>>
>>>
>>>
>>> thank you for taking the time to troll through my debugging output. 
>>> your explanation seems to accurately describe what i am experiencing. 
>>> of course i have no idea how to remedy it. :-)
>>
>> It is really quite a problem. Honestly, I don't think there is really 
>> a way how to remedy this behavior other than implement option to 
>> prefer active partition as a tie-breaker 
>> (https://github.com/corosync/corosync-qdevice/issues/7).
> 
> Jan,
> my curiosity got the best of me, so i spent some time trying to orient 
> myself to the inner workings of ffsplit.
> 
> a) how would one identify the current active partition? i might be 

This is not tracked yet. This is first thing to do - add last known 
membership into the qnetd cluster structure.

> starting too far in (or missing something), but it seems that by the 
> time we are in qnetd_algo_ffsplit_partition_cmp(), we are comparing two 
> sets of clients and node lists without the kind of context that would 
> allow us to identify the current active partition. i could not easily 
> identify the object that we would interrogate to answer that question.
> 
> b) is it possible to manage client->tie_breaker.mode and 
> client->tie_breaker.node_id dynamically to achieve the desired goal? ie, 
> if we are in a two-node cluster and one node leaves, can we "push" 
> values to the remaining client such that client->tie_breaker.mode == 
> TLV_TIE_BREAKER_MODE_NODE_ID and client->tie_breaker.node_id == 
> client->node_id?

Would be possible and it would probably work quite well for 2 node 
cluster. But imagine cluster with more nodes - this is where things 
become interesting.

Regards,
   Honza

> 
> of course i may be way off base with all of this. just wanted to ask 
> before i extracted myself from the rabbit hole.
> 



More information about the Users mailing list