[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?
Jan Friesse
jfriesse at redhat.com
Mon Mar 1 04:26:57 EST 2021
> On 27.02.2021 22:12, Andrei Borzenkov wrote:
>> On 27.02.2021 17:08, Eric Robinson wrote:
>>>
>>> I agree, one node is expected to go out of quorum. Still the question is, why didn't 001db01b take over the services? I just remembered that 001db01b has services running on it, and those services did not stop, so it seems that 001db01b did not lose quorum. So why didn't it take over the services that were running on 001db01a?
>>
>> That I cannot answer. I cannot reproduce it using similar configuration.
>
> Hmm ... actually I can.
>
> Two nodes ha1 and ha2 + qdevice. I blocked all communication *from* ha1
> (to be precise - all packets with ha1 source MAC are dropped). This
> happened around 10:43:45. Now look:
>
> ha1 immediately stops all services:
>
> Feb 28 10:43:44 ha1 corosync[3692]: [TOTEM ] A processor failed,
> forming new configuration.
> Feb 28 10:43:47 ha1 corosync[3692]: [VOTEQ ] waiting for quorum device
> Qdevice poll (but maximum for 30000 ms)
> Feb 28 10:43:47 ha1 corosync[3692]: [TOTEM ] A new membership
> (192.168.1.1:2944) was formed. Members left: 2
> Feb 28 10:43:47 ha1 corosync[3692]: [TOTEM ] Failed to receive the
> leave message. failed: 2
> Feb 28 10:43:47 ha1 corosync[3692]: [CPG ] downlist left_list: 1
> received
> Feb 28 10:43:47 ha1 pacemaker-attrd[3703]: notice: Node ha2 state is
> now lost
> Feb 28 10:43:47 ha1 pacemaker-attrd[3703]: notice: Removing all ha2
> attributes for peer loss
> Feb 28 10:43:47 ha1 pacemaker-attrd[3703]: notice: Purged 1 peer with
> id=2 and/or uname=ha2 from the membership cache
> Feb 28 10:43:47 ha1 pacemaker-based[3700]: notice: Node ha2 state is
> now lost
> Feb 28 10:43:47 ha1 pacemaker-based[3700]: notice: Purged 1 peer with
> id=2 and/or uname=ha2 from the membership cache
> Feb 28 10:43:47 ha1 pacemaker-controld[3705]: warning: Stonith/shutdown
> of node ha2 was not expected
> Feb 28 10:43:47 ha1 pacemaker-controld[3705]: notice: State transition
> S_IDLE -> S_POLICY_ENGINE
> Feb 28 10:43:47 ha1 pacemaker-fenced[3701]: notice: Node ha2 state is
> now lost
> Feb 28 10:43:47 ha1 pacemaker-fenced[3701]: notice: Purged 1 peer with
> id=2 and/or uname=ha2 from the membership cache
> Feb 28 10:43:48 ha1 corosync[3692]: [VOTEQ ] waiting for quorum device
> Qdevice poll (but maximum for 30000 ms)
> Feb 28 10:43:48 ha1 corosync[3692]: [TOTEM ] A new membership
> (192.168.1.1:2948) was formed. Members
> Feb 28 10:43:48 ha1 corosync[3692]: [CPG ] downlist left_list: 0
> received
> Feb 28 10:43:50 ha1 corosync[3692]: [VOTEQ ] waiting for quorum device
> Qdevice poll (but maximum for 30000 ms)
> Feb 28 10:43:50 ha1 corosync[3692]: [TOTEM ] A new membership
> (192.168.1.1:2952) was formed. Members
> Feb 28 10:43:50 ha1 corosync[3692]: [CPG ] downlist left_list: 0
> received
> Feb 28 10:43:51 ha1 corosync[3692]: [VOTEQ ] waiting for quorum device
> Qdevice poll (but maximum for 30000 ms)
> Feb 28 10:43:51 ha1 corosync[3692]: [TOTEM ] A new membership
> (192.168.1.1:2956) was formed. Members
> Feb 28 10:43:51 ha1 corosync[3692]: [CPG ] downlist left_list: 0
> received
> Feb 28 10:43:56 ha1 corosync-qdevice[4522]: Server didn't send echo
> reply message on time
> Feb 28 10:43:56 ha1 corosync-qdevice[4522]: Feb 28 10:43:56 error
> Server didn't send echo reply message on time
> Feb 28 10:43:56 ha1 corosync[3692]: [QUORUM] This node is within the
> non-primary component and will NOT provide any services.
> Feb 28 10:43:56 ha1 corosync[3692]: [QUORUM] Members[1]: 1
> Feb 28 10:43:56 ha1 corosync[3692]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Feb 28 10:43:56 ha1 pacemaker-controld[3705]: warning: Quorum lost
> Feb 28 10:43:56 ha1 pacemaker-controld[3705]: notice: Node ha2 state is
> now lost
> Feb 28 10:43:56 ha1 pacemaker-controld[3705]: warning: Stonith/shutdown
> of node ha2 was not expected
> Feb 28 10:43:56 ha1 pacemaker-controld[3705]: notice: Updating quorum
> status to false (call=274)
> Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: warning: Fencing and
> resource management disabled due to lack of quorum
> Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: notice: * Stop
> p_drbd0:0 ( Master ha1 ) due to no quorum
> Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: notice: * Stop
> p_drbd1:0 ( Slave ha1 ) due to no quorum
> Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: notice: * Stop
> p_fs_clust01 ( ha1 ) due to no quorum
> Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: notice: * Start
> p_fs_clust02 ( ha1 ) due to no quorum (blocked)
> Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: notice: * Stop
> p_mysql_001 ( ha1 ) due to no quorum
> Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]: notice: * Start
> p_mysql_006 ( ha1 ) due to no quorum (blocked)
>
>
>
> ha2 *waits for 30 seconds* before doing anything:
>
> Feb 28 10:43:44 ha2 corosync[5389]: [TOTEM ] A processor failed,
> forming new configuration.
> Feb 28 10:43:45 ha2 corosync[5389]: [VOTEQ ] waiting for quorum device
> Qdevice poll (but maximum for 30000 ms)
> Feb 28 10:43:45 ha2 corosync[5389]: [TOTEM ] A new membership
> (192.168.1.2:2936) was formed. Members left: 1
> Feb 28 10:43:45 ha2 corosync[5389]: [TOTEM ] Failed to receive the
> leave message. failed: 1
> Feb 28 10:43:45 ha2 corosync[5389]: [CPG ] downlist left_list: 1
> received
> Feb 28 10:43:45 ha2 pacemaker-attrd[5660]: notice: Lost attribute
> writer ha1
> Feb 28 10:43:45 ha2 pacemaker-attrd[5660]: notice: Node ha1 state is
> now lost
> Feb 28 10:43:45 ha2 pacemaker-attrd[5660]: notice: Removing all ha1
> attributes for peer loss
> Feb 28 10:43:45 ha2 pacemaker-attrd[5660]: notice: Purged 1 peer with
> id=1 and/or uname=ha1 from the membership cache
> Feb 28 10:43:45 ha2 pacemaker-based[5657]: notice: Node ha1 state is
> now lost
> Feb 28 10:43:45 ha2 pacemaker-based[5657]: notice: Purged 1 peer with
> id=1 and/or uname=ha1 from the membership cache
> Feb 28 10:43:45 ha2 pacemaker-controld[5662]: notice: Our peer on the
> DC (ha1) is dead
> Feb 28 10:43:45 ha2 pacemaker-controld[5662]: notice: State transition
> S_NOT_DC -> S_ELECTION
> Feb 28 10:43:45 ha2 pacemaker-fenced[5658]: notice: Node ha1 state is
> now lost
> Feb 28 10:43:45 ha2 pacemaker-fenced[5658]: notice: Purged 1 peer with
> id=1 and/or uname=ha1 from the membership cache
> Feb 28 10:44:15 ha2 corosync[5389]: [VOTEQ ] lost contact with quorum
> device Qdevice
> Feb 28 10:44:15 ha2 corosync[5389]: [QUORUM] This node is within the
> non-primary component and will NOT provide any services.
> Feb 28 10:44:15 ha2 corosync[5389]: [QUORUM] Members[1]: 2
> Feb 28 10:44:15 ha2 corosync[5389]: [MAIN ] Completed service
> synchronization, ready to provide service.
>
>
> Now I recognize it and I believe we have seen variants of this already.
> Key is
>
> corosync[5389]: [VOTEQ ] waiting for quorum device Qdevice poll (but
> maximum for 30000 ms)
>
> ha1 lost connection to qnetd so it gives up all hope immediately. ha2
> retains connection to qnetd so it waits for final decision before
> continuing.
>
Thanks for digging into logs. I believe Eric is hitting
https://github.com/corosync/corosync-qdevice/issues/10 (already fixed,
but may take some time to get into distributions) - it also contains
workaround.
Honza
> In your case apparently one node was completely disconnected for 15
> seconds, then connectivity resumed. The second node was still waiting
> for qdevice/qnetd decision. So it appears to work as expected.
>
> Note that fencing would not have been initiated before timeout as well.
> Fencing /may/ have been initiated after nodes established connection
> again and saw that one resource failed to stop. This would automatically
> resolve your issue. I need to think how to reproduce stop failure.
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
More information about the Users
mailing list