[ClusterLabs] temporary loss of quorum when member starts to rejoin

Tue Apr 7 00:53:36 EDT 2020

On April 7, 2020 12:21:50 AM GMT+03:00, Sherrard Burton <sb-clusterlabs at allafrica.com> wrote:
>
>
>On 4/6/20 4:10 PM, Andrei Borzenkov wrote:
>> 06.04.2020 20:57, Sherrard Burton пишет:
>>>
>>>
>>> On 4/6/20 1:20 PM, Sherrard Burton wrote:
>>>>
>>>>
>>>> On 4/6/20 12:35 PM, Andrei Borzenkov wrote:
>>>>> 06.04.2020 17:05, Sherrard Burton пишет:
>>>>>>
>>>>>> from the quorum node:
>>>> ...
>>>>>> Apr 05 23:10:17 debug   Client ::ffff:192.168.250.50:54462
>(cluster
>>>>>> xen-nfs01_xen-nfs02, node_id 1) sent quorum node list.
>>>>>> Apr 05 23:10:17 debug     msg seq num = 6
>>>>>> Apr 05 23:10:17 debug     quorate = 0
>>>>>> Apr 05 23:10:17 debug     node list:
>>>>>> Apr 05 23:10:17 debug       node_id = 1, data_center_id = 0,
>node_state
>>>>>> = member
>>>>>
>>>>> Oops. How comes that node that was rebooted formed cluster all by
>>>>> itself, without seeing the second node? Do you have two_nodes
>and/or
>>>>> wait_for_all configured?
>>>>>
>>>
>>> i never thought to check the logs on the rebooted server. hopefully
>>> someone can extract some further useful information here:
>>>
>>>
>>> https://pastebin.com/imnYKBMN
>>>
>> 
>> It looks like some timing issue or race condition. After reboot node
>> manages to contact qnetd first, before connection to other node is
>> established. Qnetd behaves as documented - it sees two equal size
>> partitions and favors the partition that includes tie breaker (lowest
>> node id). So existing node goes out of quorum. Second later both
>nodes
>> see each other and so quorum is regained.
>
>
>thank you for taking the time to troll through my debugging output.
>your 
>explanation seems to accurately describe what i am experiencing. of 
>course i have no idea how to remedy it. :-)
>
>> 
>> I cannot reproduce it, but I also do not use knet. From documentation
>I
>> have impression that knet has artificial delay before it considers
>links
>> operational, so may be that is the reason.
>
>i will do some reading on how knet factors into all of this and respond
>
>with any questions or discoveries.
>
>> 
>>>>
>>>> BTW, great eyes. i had not picked up on that little nuance. i had
>>>> poured through this particular log a number of times, but it was
>very
>>>> hard for me to discern the starting and stopping points for each
>>>> logical group of messages. the indentation made some of it clear.
>but
>>>> when you have a series of lines beginning in the left-most column,
>it
>>>> is not clear whether they belong to the previous group, the next
>>>> group, or they are their own group.
>>>>
>>>> just wanted to note my confusion in case the relevant maintainer
>>>> happens across this thread.
>>>>
>>>> thanks again
>>>> _______________________________________________
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>> 
>> ClusterLabs home: https://www.clusterlabs.org/
>> 
>_______________________________________________
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/

Hi Sherrard,

Have you tried to increase the qnet timers in the corosync.conf ?

Best Regards,
Strahil Nikolov