[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Andrei Borzenkov arvidjaar at gmail.com
Sun Feb 28 03:05:24 EST 2021


On 27.02.2021 22:12, Andrei Borzenkov wrote:
> On 27.02.2021 17:08, Eric Robinson wrote:
>>
>> I agree, one node is expected to go out of quorum. Still the question is, why didn't 001db01b take over the services? I just remembered that 001db01b has services running on it, and those services did not stop, so it seems that 001db01b did not lose quorum. So why didn't it take over the services that were running on 001db01a?
> 
> That I cannot answer. I cannot reproduce it using similar configuration.

Hmm ... actually I can.

Two nodes ha1 and ha2 + qdevice. I blocked all communication *from* ha1
(to be precise - all packets with ha1 source MAC are dropped). This
happened around 10:43:45. Now look:

ha1 immediately stops all services:

Feb 28 10:43:44 ha1 corosync[3692]:   [TOTEM ] A processor failed,
forming new configuration.
Feb 28 10:43:47 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
Feb 28 10:43:47 ha1 corosync[3692]:   [TOTEM ] A new membership
(192.168.1.1:2944) was formed. Members left: 2
Feb 28 10:43:47 ha1 corosync[3692]:   [TOTEM ] Failed to receive the
leave message. failed: 2
Feb 28 10:43:47 ha1 corosync[3692]:   [CPG   ] downlist left_list: 1
received
Feb 28 10:43:47 ha1 pacemaker-attrd[3703]:  notice: Node ha2 state is
now lost
Feb 28 10:43:47 ha1 pacemaker-attrd[3703]:  notice: Removing all ha2
attributes for peer loss
Feb 28 10:43:47 ha1 pacemaker-attrd[3703]:  notice: Purged 1 peer with
id=2 and/or uname=ha2 from the membership cache
Feb 28 10:43:47 ha1 pacemaker-based[3700]:  notice: Node ha2 state is
now lost
Feb 28 10:43:47 ha1 pacemaker-based[3700]:  notice: Purged 1 peer with
id=2 and/or uname=ha2 from the membership cache
Feb 28 10:43:47 ha1 pacemaker-controld[3705]:  warning: Stonith/shutdown
of node ha2 was not expected
Feb 28 10:43:47 ha1 pacemaker-controld[3705]:  notice: State transition
S_IDLE -> S_POLICY_ENGINE
Feb 28 10:43:47 ha1 pacemaker-fenced[3701]:  notice: Node ha2 state is
now lost
Feb 28 10:43:47 ha1 pacemaker-fenced[3701]:  notice: Purged 1 peer with
id=2 and/or uname=ha2 from the membership cache
Feb 28 10:43:48 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
Feb 28 10:43:48 ha1 corosync[3692]:   [TOTEM ] A new membership
(192.168.1.1:2948) was formed. Members
Feb 28 10:43:48 ha1 corosync[3692]:   [CPG   ] downlist left_list: 0
received
Feb 28 10:43:50 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
Feb 28 10:43:50 ha1 corosync[3692]:   [TOTEM ] A new membership
(192.168.1.1:2952) was formed. Members
Feb 28 10:43:50 ha1 corosync[3692]:   [CPG   ] downlist left_list: 0
received
Feb 28 10:43:51 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
Feb 28 10:43:51 ha1 corosync[3692]:   [TOTEM ] A new membership
(192.168.1.1:2956) was formed. Members
Feb 28 10:43:51 ha1 corosync[3692]:   [CPG   ] downlist left_list: 0
received
Feb 28 10:43:56 ha1 corosync-qdevice[4522]: Server didn't send echo
reply message on time
Feb 28 10:43:56 ha1 corosync-qdevice[4522]: Feb 28 10:43:56 error
Server didn't send echo reply message on time
Feb 28 10:43:56 ha1 corosync[3692]:   [QUORUM] This node is within the
non-primary component and will NOT provide any services.
Feb 28 10:43:56 ha1 corosync[3692]:   [QUORUM] Members[1]: 1
Feb 28 10:43:56 ha1 corosync[3692]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  warning: Quorum lost
Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  notice: Node ha2 state is
now lost
Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  warning: Stonith/shutdown
of node ha2 was not expected
Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  notice: Updating quorum
status to false (call=274)
Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  warning: Fencing and
resource management disabled due to lack of quorum
Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Stop
p_drbd0:0        (            Master ha1 )   due to no quorum
Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Stop
p_drbd1:0        (             Slave ha1 )   due to no quorum
Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Stop
p_fs_clust01     (                   ha1 )   due to no quorum
Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Start
p_fs_clust02     (                   ha1 )   due to no quorum (blocked)
Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Stop
p_mysql_001      (                   ha1 )   due to no quorum
Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Start
p_mysql_006      (                   ha1 )   due to no quorum (blocked)



ha2 *waits for 30 seconds* before doing anything:

Feb 28 10:43:44 ha2 corosync[5389]:   [TOTEM ] A processor failed,
forming new configuration.
Feb 28 10:43:45 ha2 corosync[5389]:   [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
Feb 28 10:43:45 ha2 corosync[5389]:   [TOTEM ] A new membership
(192.168.1.2:2936) was formed. Members left: 1
Feb 28 10:43:45 ha2 corosync[5389]:   [TOTEM ] Failed to receive the
leave message. failed: 1
Feb 28 10:43:45 ha2 corosync[5389]:   [CPG   ] downlist left_list: 1
received
Feb 28 10:43:45 ha2 pacemaker-attrd[5660]:  notice: Lost attribute
writer ha1
Feb 28 10:43:45 ha2 pacemaker-attrd[5660]:  notice: Node ha1 state is
now lost
Feb 28 10:43:45 ha2 pacemaker-attrd[5660]:  notice: Removing all ha1
attributes for peer loss
Feb 28 10:43:45 ha2 pacemaker-attrd[5660]:  notice: Purged 1 peer with
id=1 and/or uname=ha1 from the membership cache
Feb 28 10:43:45 ha2 pacemaker-based[5657]:  notice: Node ha1 state is
now lost
Feb 28 10:43:45 ha2 pacemaker-based[5657]:  notice: Purged 1 peer with
id=1 and/or uname=ha1 from the membership cache
Feb 28 10:43:45 ha2 pacemaker-controld[5662]:  notice: Our peer on the
DC (ha1) is dead
Feb 28 10:43:45 ha2 pacemaker-controld[5662]:  notice: State transition
S_NOT_DC -> S_ELECTION
Feb 28 10:43:45 ha2 pacemaker-fenced[5658]:  notice: Node ha1 state is
now lost
Feb 28 10:43:45 ha2 pacemaker-fenced[5658]:  notice: Purged 1 peer with
id=1 and/or uname=ha1 from the membership cache
Feb 28 10:44:15 ha2 corosync[5389]:   [VOTEQ ] lost contact with quorum
device Qdevice
Feb 28 10:44:15 ha2 corosync[5389]:   [QUORUM] This node is within the
non-primary component and will NOT provide any services.
Feb 28 10:44:15 ha2 corosync[5389]:   [QUORUM] Members[1]: 2
Feb 28 10:44:15 ha2 corosync[5389]:   [MAIN  ] Completed service
synchronization, ready to provide service.


Now I recognize it and I believe we have seen variants of this already.
Key is

corosync[5389]:   [VOTEQ ] waiting for quorum device Qdevice poll (but
maximum for 30000 ms)

ha1 lost connection to qnetd so it gives up all hope immediately. ha2
retains connection to qnetd so it waits for final decision before
continuing.

In your case apparently one node was completely disconnected for 15
seconds, then connectivity resumed. The second node was still waiting
for qdevice/qnetd decision. So it appears to work as expected.

Note that fencing would not have been initiated before timeout as well.
Fencing /may/ have been initiated after nodes established connection
again and saw that one resource failed to stop. This would automatically
resolve your issue. I need to think how to reproduce stop failure.



More information about the Users mailing list