[ClusterLabs] temporary loss of quorum when member starts to rejoin

Sherrard Burton sb-clusterlabs at allafrica.com
Tue Apr 7 09:21:50 EDT 2020



On 4/7/20 4:09 AM, Jan Friesse wrote:
> Sherrard and Andrei
> 
>>
>>
>> On 4/6/20 4:10 PM, Andrei Borzenkov wrote:
>>> 06.04.2020 20:57, Sherrard Burton пишет:
>>>>
>>>>
>>>> On 4/6/20 1:20 PM, Sherrard Burton wrote:
>>>>>
>>>>>
>>>>> On 4/6/20 12:35 PM, Andrei Borzenkov wrote:
>>>>>> 06.04.2020 17:05, Sherrard Burton пишет:
>>>>>>>
>>>>>>> from the quorum node:
>>>>> ...
>>>>>>> Apr 05 23:10:17 debug   Client ::ffff:192.168.250.50:54462 (cluster
>>>>>>> xen-nfs01_xen-nfs02, node_id 1) sent quorum node list.
>>>>>>> Apr 05 23:10:17 debug     msg seq num = 6
>>>>>>> Apr 05 23:10:17 debug     quorate = 0
>>>>>>> Apr 05 23:10:17 debug     node list:
>>>>>>> Apr 05 23:10:17 debug       node_id = 1, data_center_id = 0, 
>>>>>>> node_state
>>>>>>> = member
>>>>>>
>>>>>> Oops. How comes that node that was rebooted formed cluster all by
>>>>>> itself, without seeing the second node? Do you have two_nodes and/or
>>>>>> wait_for_all configured?
>>>>>>
>>>>
>>>> i never thought to check the logs on the rebooted server. hopefully
>>>> someone can extract some further useful information here:
>>>>
>>>>
>>>> https://pastebin.com/imnYKBMN
>>>>
>>>
>>> It looks like some timing issue or race condition. After reboot node
>>> manages to contact qnetd first, before connection to other node is
>>> established. Qnetd behaves as documented - it sees two equal size
>>> partitions and favors the partition that includes tie breaker (lowest
>>> node id). So existing node goes out of quorum. Second later both nodes
>>> see each other and so quorum is regained.
> 
> Nice catch
> 
>>
>>
>> thank you for taking the time to troll through my debugging output. 
>> your explanation seems to accurately describe what i am experiencing. 
>> of course i have no idea how to remedy it. :-)
> 
> It is really quite a problem. Honestly, I don't think there is really a 
> way how to remedy this behavior other than implement option to prefer 
> active partition as a tie-breaker 
> (https://github.com/corosync/corosync-qdevice/issues/7).
> 
> 
>>
>>>
>>> I cannot reproduce it, but I also do not use knet. From documentation I
>>> have impression that knet has artificial delay before it considers links
>>> operational, so may be that is the reason.
>>
>> i will do some reading on how knet factors into all of this and 
>> respond with any questions or discoveries.
> 
> knet_pong_count/knet_ping_interval tuning may help, but I don't think 
> there is really a way to prevent creation of single node membership in 
> all possible cases.

yes. in my limited thinking about it, i keep coming back around to that 
conclusion in the two-node + qdevice case, barring implementation of #7.


> 
>>
>>>
>>>>>
>>>>> BTW, great eyes. i had not picked up on that little nuance. i had
>>>>> poured through this particular log a number of times, but it was very
>>>>> hard for me to discern the starting and stopping points for each
>>>>> logical group of messages. the indentation made some of it clear. but
>>>>> when you have a series of lines beginning in the left-most column, it
>>>>> is not clear whether they belong to the previous group, the next
>>>>> group, or they are their own group.
>>>>>
>>>>> just wanted to note my confusion in case the relevant maintainer
>>>>> happens across this thread.
> 
> Here :)
> 
> Output (especially debug one) is really a bit cryptic, but I'm not 
> entirely sure how to make it better. Qnetd events have no strict 
> ordering so I don't see a way ho to group relevant events without some 
> kind of reordering and best guessing, what I'm not too keen to do. Also 
> some of the messages relates to specific nodes and some of the messages 
> relates to whole cluster (or part of the cluster).
> 
> Of course I'm open to ideas how to structure it better way.

i wish i was well-versed enough in this particular codebase to submit a 
PR. i think that some kind of tagging indicating whether messages are 
node-specific or cluster-specific would probably help a bit. but 
ultimately it is probably not worth the effort of changing the code, as 
long as the relevant parties can easily analyze the output.

> 
> Regards,
>    Honza


More information about the Users mailing list