[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Wed Mar 3 12:41:13 EST 2021

On 01.03.2021 16:45, Jan Friesse wrote:
> Andrei,
> 
>> On 01.03.2021 15:45, Jan Friesse wrote:
>>> Andrei,
>>>
>>>> On 01.03.2021 12:26, Jan Friesse wrote:
>>>>>>
>>>>>
>>>>> Thanks for digging into logs. I believe Eric is hitting
>>>>> https://github.com/corosync/corosync-qdevice/issues/10 (already fixed,
>>>>> but may take some time to get into distributions) - it also contains
>>>>> workaround.
>>>>>
>>>>
>>>> I tested corosync-qnetd at df3c672 which should include these fixes. It
>>>> changed behavior, still I cannot explain it.
>>>>
>>>> Again, ha1+ha2+qnetd, ha2 is current DC, I disconnect ha1 (block
>>>> everything with ha1 source MAC), stonith disabled. corosync and
>>>
>>> So ha1 is blocked on both ha2 and qnetd and blocking is symmetric (I
>>> mean, nothing is sent to ha1 and nothing is received from ha1)?
>>>
>>
>> No, it is asymmetric. ha1 cannot *send* anything to ha2 or qnetd; it
>> should be able to *receive* from both.
> 
> That's problem for corosync 2.x. Corosync 3.x with knet solves this by
> establishing connection only when node can both send and receive packets
> from other nodes, but udpu behavior is weird (on corosync side) when it
> is possible to receive message but not sent one (or vice versa).
> 
> It also explains why there are multiple "waiting for qdevice" messages
> logged.
> 
> Could you please try to block both outgoing and incomming packets?
> 

Several times both nodes detected problem and reacted almost
synchronously, so it probably was it.

...

>>>
>>> No mater what, are you able to provide some step-by-step reproducer of
>>> that 40 sec delay?
>>
>> No. As I said next time I tested I got entirely different timing. I will
>> try after cold boot again.
>>
> 
> Perfect, thanks.
> 

I was able to reproduce it again with asymmetric fencing after cold
boot. Are you still interested?