[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Mon Mar 1 07:45:58 EST 2021

Andrei,

> On 01.03.2021 12:26, Jan Friesse wrote:
>>>
>>
>> Thanks for digging into logs. I believe Eric is hitting
>> https://github.com/corosync/corosync-qdevice/issues/10 (already fixed,
>> but may take some time to get into distributions) - it also contains
>> workaround.
>>
> 
> I tested corosync-qnetd at df3c672 which should include these fixes. It
> changed behavior, still I cannot explain it.
> 
> Again, ha1+ha2+qnetd, ha2 is current DC, I disconnect ha1 (block
> everything with ha1 source MAC), stonith disabled. corosync and

So ha1 is blocked on both ha2 and qnetd and blocking is symmetric (I 
mean, nothing is sent to ha1 and nothing is received from ha1)?

> corosync-qdevice on nodes are still 2.4.5 if it matters.

Shouldn't really matter as long as both corosync-qdevice and 
corosync-qnetd are version 3.0.1.

> 
> ha2:
> 
> ar 01 13:23:27 ha2 corosync[1576]:   [TOTEM ] A processor failed,
> forming new configuration.
> Mar 01 13:23:28 ha2 corosync[1576]:   [VOTEQ ] waiting for quorum device
> Qdevice poll (but maximum for 30000 ms)
> Mar 01 13:23:28 ha2 corosync[1576]:   [TOTEM ] A new membership
> (192.168.1.2:3632) was formed. Members left: 1
> Mar 01 13:23:28 ha2 corosync[1576]:   [TOTEM ] Failed to receive the
> leave message. failed: 1
> Mar 01 13:23:28 ha2 corosync[1576]:   [CPG   ] downlist left_list: 1
> received
> Mar 01 13:23:28 ha2 pacemaker-based[2032]:  notice: Node ha1 state is
> now lost
> Mar 01 13:23:28 ha2 pacemaker-based[2032]:  notice: Purged 1 peer with
> id=1 and/or uname=ha1 from the membership cache
> Mar 01 13:23:28 ha2 pacemaker-attrd[2035]:  notice: Lost attribute
> writer ha1
> Mar 01 13:23:28 ha2 pacemaker-attrd[2035]:  notice: Node ha1 state is
> now lost
> Mar 01 13:23:28 ha2 pacemaker-attrd[2035]:  notice: Removing all ha1
> attributes for peer loss
> Mar 01 13:23:28 ha2 pacemaker-attrd[2035]:  notice: Purged 1 peer with
> id=1 and/or uname=ha1 from the membership cache
> Mar 01 13:23:28 ha2 pacemaker-fenced[2033]:  notice: Node ha1 state is
> now lost
> Mar 01 13:23:28 ha2 pacemaker-fenced[2033]:  notice: Purged 1 peer with
> id=1 and/or uname=ha1 from the membership cache
> Mar 01 13:23:28 ha2 pacemaker-controld[2037]:  warning: Stonith/shutdown
> of node ha1 was not expected
> Mar 01 13:23:28 ha2 pacemaker-controld[2037]:  notice: State transition
> S_IDLE -> S_POLICY_ENGINE
> Mar 01 13:23:33 ha2 pacemaker-controld[2037]:  notice: High CPU load
> detected: 1.200000
> Mar 01 13:23:35 ha2 corosync[1576]:   [QUORUM] Members[1]: 2
> Mar 01 13:23:35 ha2 corosync[1576]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Mar 01 13:23:35 ha2 pacemaker-attrd[2035]:  notice: Recorded local node
> as attribute writer (was unset)
> Mar 01 13:23:35 ha2 pacemaker-controld[2037]:  notice: Node ha1 state is
> now lost
> Mar 01 13:23:35 ha2 pacemaker-controld[2037]:  warning: Stonith/shutdown
> of node ha1 was not expected
> Mar 01 13:23:36 ha2 pacemaker-schedulerd[2036]:  notice:  * Promote
> p_drbd0:0        (   Slave -> Master ha2 )
> Mar 01 13:23:36 ha2 pacemaker-schedulerd[2036]:  notice:  * Start
> p_fs_clust01     (                   ha2 )
> Mar 01 13:23:36 ha2 pacemaker-schedulerd[2036]:  notice:  * Start
> p_mysql_001      (                   ha2 )
> 
> 
> So it is pretty fast to react (8 seconds)
> 
> ha1:
> 
> Mar 01 13:23:27 ha1 corosync[1552]:   [TOTEM ] A processor failed,
> forming new configuration.
> Mar 01 13:23:30 ha1 corosync[1552]:   [VOTEQ ] waiting for quorum device
> Qdevice poll (but maximum for 30000 ms)
> Mar 01 13:23:30 ha1 corosync[1552]:   [TOTEM ] A new membership
> (192.168.1.1:3640) was formed. Members left: 2
> Mar 01 13:23:30 ha1 corosync[1552]:   [TOTEM ] Failed to receive the
> leave message. failed: 2
> Mar 01 13:23:30 ha1 corosync[1552]:   [CPG   ] downlist left_list: 1
> received
> Mar 01 13:23:30 ha1 pacemaker-attrd[1738]:  notice: Node ha2 state is
> now lost
> Mar 01 13:23:30 ha1 pacemaker-attrd[1738]:  notice: Removing all ha2
> attributes for peer loss
> Mar 01 13:23:30 ha1 pacemaker-attrd[1738]:  notice: Purged 1 peer with
> id=2 and/or uname=ha2 from the membership cache
> Mar 01 13:23:30 ha1 pacemaker-based[1735]:  notice: Node ha2 state is
> now lost
> Mar 01 13:23:30 ha1 pacemaker-based[1735]:  notice: Purged 1 peer with
> id=2 and/or uname=ha2 from the membership cache
> Mar 01 13:23:30 ha1 pacemaker-controld[1740]:  notice: Our peer on the
> DC (ha2) is dead
> Mar 01 13:23:30 ha1 pacemaker-controld[1740]:  notice: State transition
> S_NOT_DC -> S_ELECTION
> Mar 01 13:23:30 ha1 pacemaker-fenced[1736]:  notice: Node ha2 state is
> now lost
> Mar 01 13:23:30 ha1 pacemaker-fenced[1736]:  notice: Purged 1 peer with
> id=2 and/or uname=ha2 from the membership cache
> Mar 01 13:23:32 ha1 corosync[1552]:   [VOTEQ ] waiting for quorum device
> Qdevice poll (but maximum for 30000 ms)
> Mar 01 13:23:32 ha1 corosync[1552]:   [TOTEM ] A new membership
> (192.168.1.1:3644) was formed. Members
> Mar 01 13:23:32 ha1 corosync[1552]:   [CPG   ] downlist left_list: 0
> received
> Mar 01 13:23:33 ha1 corosync[1552]:   [VOTEQ ] waiting for quorum device
> Qdevice poll (but maximum for 30000 ms)
> Mar 01 13:23:33 ha1 corosync[1552]:   [TOTEM ] A new membership
> (192.168.1.1:3648) was formed. Members
> Mar 01 13:23:33 ha1 corosync[1552]:   [CPG   ] downlist left_list: 0
> received
> Mar 01 13:23:35 ha1 corosync[1552]:   [VOTEQ ] waiting for quorum device
> Qdevice poll (but maximum for 30000 ms)
> ...
> Mar 01 13:24:05 ha1 corosync-qdevice[1563]: Can't connect to qnetd host.
> (-5986): Network address not available (in use?)
> Mar 01 13:24:05 ha1 corosync-qdevice[1563]: Mar 01 13:24:05 error
> Can't connect to qnetd host. (-5986): Network address not available (in
> use?)
> Mar 01 13:24:05 ha1 corosync[1552]:   [VOTEQ ] waiting for quorum device
> Qdevice poll (but maximum for 30000 ms)

Yes, this is strange and shouldn't really happen. There should be just 
one "waiting for quorum device Qdevice poll".

> Mar 01 13:24:05 ha1 corosync[1552]:   [TOTEM ] A new membership
> (192.168.1.1:3736) was formed. Members
> Mar 01 13:24:05 ha1 corosync[1552]:   [CPG   ] downlist left_list: 0
> received
> Mar 01 13:24:05 ha1 corosync[1552]:   [QUORUM] This node is within the
> non-primary component and will NOT provide any services.
> Mar 01 13:24:05 ha1 corosync[1552]:   [QUORUM] Members[1]: 1
> Mar 01 13:24:05 ha1 corosync[1552]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Mar 01 13:24:05 ha1 pacemaker-controld[1740]:  warning: Quorum lost
> Mar 01 13:24:05 ha1 pacemaker-controld[1740]:  notice: Node ha2 state is
> now lost
> Mar 01 13:24:05 ha1 pacemaker-controld[1740]:  notice: State transition
> S_ELECTION -> S_INTEGRATION
> Mar 01 13:24:05 ha1 pacemaker-controld[1740]:  notice: Updating quorum
> status to false (call=56)
> Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]:  warning: Blind faith:
> not fencing unseen nodes
> Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]:  warning: Fencing and
> resource management disabled due to lack of quorum
> Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]:  notice:  * Stop
> p_drbd0:0     ( Master ha1 )   due to no quorum
> Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]:  notice:  * Stop
> p_drbd1:0     (  Slave ha1 )   due to no quorum
> Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]:  notice:  * Stop
> p_fs_clust01     (        ha1 )   due to no quorum
> Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]:  notice:  * Start
> p_fs_clust02     (        ha1 )   due to no quorum (blocked)
> Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]:  notice:  * Stop
> p_mysql_001      (        ha1 )   due to no quorum
> Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]:  notice:  * Start
> p_mysql_006      (        ha1 )   due to no quorum (blocked)
> Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]:  notice:  * Start
> p_mysql_666      (        ha1 )   due to no quorum (blocked)
> Mar 01 13:24:05 ha1 pacemaker-controld[1740]:  notice: Processing graph
> 0 (ref=pe_calc-dc-16145
> 
> 
> So it took it almost 40 seconds to make decision. Somehow it is exactly
> the opposite of what I observed before - disconnected node was fast and
> connected node was slow.
> 
> While I can understand why behavior changed for connected node, I still
> do not understand why disconnected node now needs so much time.
> 
> And what is worse, this is not reliable - next time I test both nodes
> react almost immediately (just 3 seconds to reach decision it is out of
> quorum for disconnected node). That is the most irritating, as one
> expects consistent behavior here.

That's a bit harder to explain but it has a reason.

Qnetd requires from qdevice to send some message at least every 
dpd_interval_coefficient * heartbeat_interval (by default 1.5 * 8000 
ms). If nothing is received in that timeframe then client is 
disconnected by qnetd.

Qdevice sends heartbeat (by default) every 8000 ms. It is also expecting 
reply from qnetd before next heartbeat is sent and if there is no reply 
it will disconnect from qnetd.

So typical situation is:
- node 1, 2 and qnetd are running
- node 1 dies
- node 2 corosync finds out node 1 missing and waits for qdevice
- qdevice on node 2 sends new information to qnetd
- qnetd finds out that cluster information provided by node 1 and node 2 
diverges so it will wait till node 1 sends updated information
- node 2 qdevice is waiting for qnetd reply
- qnetd timeout for node 1 expires so it node 1 is disconnected. That 
means membership information of cluster no longer diverges (there is 
only one node left) so qdevice on node 2 is informed about new decision

The "randomness" of reaction time has two reasons. Reason 1 - 
qnetd-qdevice heartbeat are not connected with corosync one and they are 
running in parallel. Reason 2 is just a matter of when node 1 sent its 
last message to qnetd. If last message was sent right before it was 
shutdown then qnetd waits for (almost) full 12 sec, but if it was closer 
to the timeout expiration (so let's say heartbeat was sent 7 sec ago) 
then qnetd needs to wait for much shorter time (in this example it would 
be really just 5 sec). Combined with Reason 1 there is +-3 sec before 
corosync says "waiting for quorum device Qdevice poll ..." and qdevice 
providing the vote.

> 
> That is something inside corosync/corosync-qdevice. At least it seems to
> improve situation with qnetd response timing to survived nodes.
> 

No mater what, are you able to provide some step-by-step reproducer of 
that 40 sec delay?  Because it's definitively wrong. The worst case 
should be 16 sec.

Regards,
   Honza