[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Mon Mar 1 08:45:04 EST 2021

Andrei,

> On 01.03.2021 15:45, Jan Friesse wrote:
>> Andrei,
>>
>>> On 01.03.2021 12:26, Jan Friesse wrote:
>>>>>
>>>>
>>>> Thanks for digging into logs. I believe Eric is hitting
>>>> https://github.com/corosync/corosync-qdevice/issues/10 (already fixed,
>>>> but may take some time to get into distributions) - it also contains
>>>> workaround.
>>>>
>>>
>>> I tested corosync-qnetd at df3c672 which should include these fixes. It
>>> changed behavior, still I cannot explain it.
>>>
>>> Again, ha1+ha2+qnetd, ha2 is current DC, I disconnect ha1 (block
>>> everything with ha1 source MAC), stonith disabled. corosync and
>>
>> So ha1 is blocked on both ha2 and qnetd and blocking is symmetric (I
>> mean, nothing is sent to ha1 and nothing is received from ha1)?
>>
> 
> No, it is asymmetric. ha1 cannot *send* anything to ha2 or qnetd; it
> should be able to *receive* from both.

That's problem for corosync 2.x. Corosync 3.x with knet solves this by 
establishing connection only when node can both send and receive packets 
from other nodes, but udpu behavior is weird (on corosync side) when it 
is possible to receive message but not sent one (or vice versa).

It also explains why there are multiple "waiting for qdevice" messages 
logged.

Could you please try to block both outgoing and incomming packets?

> 
>>> corosync-qdevice on nodes are still 2.4.5 if it matters.
>>
>> Shouldn't really matter as long as both corosync-qdevice and
>> corosync-qnetd are version 3.0.1.
>>
> 
> corosync-qdevice on nodes is still 2.4.5. corosync-qnetd on witness is
> git snapshot from last November. I was not sure I could mix corosync and
> corosync-qdevice of different versions and looking at git commit all

It is (or should be) possible. I was testing this scenario (old qnetd + 
new qdevice and old qdevice + new qnetd) before releasing 3.0.1 (not 
extensivelly tho so of there can be some bugs which I haven't spotted).

> changes seem to be in qnetd anyway.

True

> 
> ...
> 
>>
>> That's a bit harder to explain but it has a reason.
>>
> 
> OK, thank you.
> ...
>>
>> No mater what, are you able to provide some step-by-step reproducer of
>> that 40 sec delay?
> 
> No. As I said next time I tested I got entirely different timing. I will
> try after cold boot again.
> 

Perfect, thanks.

Regards,
   Honza