[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Tue Mar 2 13:30:31 EST 2021

> -----Original Message-----
> From: Users <users-bounces at clusterlabs.org> On Behalf Of Jan Friesse
> Sent: Monday, March 1, 2021 3:27 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> <users at clusterlabs.org>; Andrei Borzenkov <arvidjaar at gmail.com>
> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
> Down Anyway?
>
>
> > On 27.02.2021 22:12, Andrei Borzenkov wrote:
> >> On 27.02.2021 17:08, Eric Robinson wrote:
> >>>
> >>> I agree, one node is expected to go out of quorum. Still the question is,
> why didn't 001db01b take over the services? I just remembered that
> 001db01b has services running on it, and those services did not stop, so it
> seems that 001db01b did not lose quorum. So why didn't it take over the
> services that were running on 001db01a?
> >>
> >> That I cannot answer. I cannot reproduce it using similar configuration.
> >
> > Hmm ... actually I can.
> >
> > Two nodes ha1 and ha2 + qdevice. I blocked all communication *from*
> > ha1 (to be precise - all packets with ha1 source MAC are dropped).
> > This happened around 10:43:45. Now look:
> >
> > ha1 immediately stops all services:
> >
> > Feb 28 10:43:44 ha1 corosync[3692]:   [TOTEM ] A processor failed,
> > forming new configuration.
> > Feb 28 10:43:47 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 30000 ms)
> > Feb 28 10:43:47 ha1 corosync[3692]:   [TOTEM ] A new membership
> > (192.168.1.1:2944) was formed. Members left: 2
> > Feb 28 10:43:47 ha1 corosync[3692]:   [TOTEM ] Failed to receive the
> > leave message. failed: 2
> > Feb 28 10:43:47 ha1 corosync[3692]:   [CPG   ] downlist left_list: 1
> > received
> > Feb 28 10:43:47 ha1 pacemaker-attrd[3703]:  notice: Node ha2 state is
> > now lost Feb 28 10:43:47 ha1 pacemaker-attrd[3703]:  notice: Removing
> > all ha2 attributes for peer loss Feb 28 10:43:47 ha1
> > pacemaker-attrd[3703]:  notice: Purged 1 peer with
> > id=2 and/or uname=ha2 from the membership cache Feb 28 10:43:47 ha1
> > pacemaker-based[3700]:  notice: Node ha2 state is now lost Feb 28
> > 10:43:47 ha1 pacemaker-based[3700]:  notice: Purged 1 peer with
> > id=2 and/or uname=ha2 from the membership cache Feb 28 10:43:47 ha1
> > pacemaker-controld[3705]:  warning: Stonith/shutdown of node ha2 was
> > not expected Feb 28 10:43:47 ha1 pacemaker-controld[3705]:  notice:
> > State transition S_IDLE -> S_POLICY_ENGINE Feb 28 10:43:47 ha1
> > pacemaker-fenced[3701]:  notice: Node ha2 state is now lost Feb 28
> > 10:43:47 ha1 pacemaker-fenced[3701]:  notice: Purged 1 peer with
> > id=2 and/or uname=ha2 from the membership cache
> > Feb 28 10:43:48 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 30000 ms)
> > Feb 28 10:43:48 ha1 corosync[3692]:   [TOTEM ] A new membership
> > (192.168.1.1:2948) was formed. Members
> > Feb 28 10:43:48 ha1 corosync[3692]:   [CPG   ] downlist left_list: 0
> > received
> > Feb 28 10:43:50 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 30000 ms)
> > Feb 28 10:43:50 ha1 corosync[3692]:   [TOTEM ] A new membership
> > (192.168.1.1:2952) was formed. Members
> > Feb 28 10:43:50 ha1 corosync[3692]:   [CPG   ] downlist left_list: 0
> > received
> > Feb 28 10:43:51 ha1 corosync[3692]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 30000 ms)
> > Feb 28 10:43:51 ha1 corosync[3692]:   [TOTEM ] A new membership
> > (192.168.1.1:2956) was formed. Members
> > Feb 28 10:43:51 ha1 corosync[3692]:   [CPG   ] downlist left_list: 0
> > received
> > Feb 28 10:43:56 ha1 corosync-qdevice[4522]: Server didn't send echo
> > reply message on time Feb 28 10:43:56 ha1 corosync-qdevice[4522]: Feb
> > 28 10:43:56 error Server didn't send echo reply message on time
> > Feb 28 10:43:56 ha1 corosync[3692]:   [QUORUM] This node is within the
> > non-primary component and will NOT provide any services.
> > Feb 28 10:43:56 ha1 corosync[3692]:   [QUORUM] Members[1]: 1
> > Feb 28 10:43:56 ha1 corosync[3692]:   [MAIN  ] Completed service
> > synchronization, ready to provide service.
> > Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  warning: Quorum lost
> > Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  notice: Node ha2 state
> > is now lost Feb 28 10:43:56 ha1 pacemaker-controld[3705]:  warning:
> > Stonith/shutdown of node ha2 was not expected Feb 28 10:43:56 ha1
> > pacemaker-controld[3705]:  notice: Updating quorum status to false
> > (call=274) Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  warning:
> > Fencing and resource management disabled due to lack of quorum Feb 28
> > 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Stop
> > p_drbd0:0        (            Master ha1 )   due to no quorum
> > Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Stop
> > p_drbd1:0        (             Slave ha1 )   due to no quorum
> > Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Stop
> > p_fs_clust01     (                   ha1 )   due to no quorum
> > Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Start
> > p_fs_clust02     (                   ha1 )   due to no quorum (blocked)
> > Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Stop
> > p_mysql_001      (                   ha1 )   due to no quorum
> > Feb 28 10:43:57 ha1 pacemaker-schedulerd[3704]:  notice:  * Start
> > p_mysql_006      (                   ha1 )   due to no quorum (blocked)
> >
> >
> >
> > ha2 *waits for 30 seconds* before doing anything:
> >
> > Feb 28 10:43:44 ha2 corosync[5389]:   [TOTEM ] A processor failed,
> > forming new configuration.
> > Feb 28 10:43:45 ha2 corosync[5389]:   [VOTEQ ] waiting for quorum device
> > Qdevice poll (but maximum for 30000 ms)
> > Feb 28 10:43:45 ha2 corosync[5389]:   [TOTEM ] A new membership
> > (192.168.1.2:2936) was formed. Members left: 1
> > Feb 28 10:43:45 ha2 corosync[5389]:   [TOTEM ] Failed to receive the
> > leave message. failed: 1
> > Feb 28 10:43:45 ha2 corosync[5389]:   [CPG   ] downlist left_list: 1
> > received
> > Feb 28 10:43:45 ha2 pacemaker-attrd[5660]:  notice: Lost attribute
> > writer ha1 Feb 28 10:43:45 ha2 pacemaker-attrd[5660]:  notice: Node
> > ha1 state is now lost Feb 28 10:43:45 ha2 pacemaker-attrd[5660]:
> > notice: Removing all ha1 attributes for peer loss Feb 28 10:43:45 ha2
> > pacemaker-attrd[5660]:  notice: Purged 1 peer with
> > id=1 and/or uname=ha1 from the membership cache Feb 28 10:43:45 ha2
> > pacemaker-based[5657]:  notice: Node ha1 state is now lost Feb 28
> > 10:43:45 ha2 pacemaker-based[5657]:  notice: Purged 1 peer with
> > id=1 and/or uname=ha1 from the membership cache Feb 28 10:43:45 ha2
> > pacemaker-controld[5662]:  notice: Our peer on the DC (ha1) is dead
> > Feb 28 10:43:45 ha2 pacemaker-controld[5662]:  notice: State
> > transition S_NOT_DC -> S_ELECTION Feb 28 10:43:45 ha2
> > pacemaker-fenced[5658]:  notice: Node ha1 state is now lost Feb 28
> > 10:43:45 ha2 pacemaker-fenced[5658]:  notice: Purged 1 peer with
> > id=1 and/or uname=ha1 from the membership cache
> > Feb 28 10:44:15 ha2 corosync[5389]:   [VOTEQ ] lost contact with quorum
> > device Qdevice
> > Feb 28 10:44:15 ha2 corosync[5389]:   [QUORUM] This node is within the
> > non-primary component and will NOT provide any services.
> > Feb 28 10:44:15 ha2 corosync[5389]:   [QUORUM] Members[1]: 2
> > Feb 28 10:44:15 ha2 corosync[5389]:   [MAIN  ] Completed service
> > synchronization, ready to provide service.
> >
> >
> > Now I recognize it and I believe we have seen variants of this already.
> > Key is
> >
> > corosync[5389]:   [VOTEQ ] waiting for quorum device Qdevice poll (but
> > maximum for 30000 ms)
> >
> > ha1 lost connection to qnetd so it gives up all hope immediately. ha2
> > retains connection to qnetd so it waits for final decision before
> > continuing.
> >
>
> Thanks for digging into logs. I believe Eric is hitting
> https://github.com/corosync/corosync-qdevice/issues/10 (already fixed, but
> may take some time to get into distributions) - it also contains workaround.
>
> Honza
>

Reading through that linked thread, it seems that quorum timeouts are tricky to get right. I made some changes over the weekend and increased my token timeout to 5000. Are there other timeouts I should adjust to make sure I don't run into a complicated race condition that causes weird/random failures due to mismatched or misaligned timeouts?

> > In your case apparently one node was completely disconnected for 15
> > seconds, then connectivity resumed. The second node was still waiting
> > for qdevice/qnetd decision. So it appears to work as expected.
> >
> > Note that fencing would not have been initiated before timeout as well.
> > Fencing /may/ have been initiated after nodes established connection
> > again and saw that one resource failed to stop. This would
> > automatically resolve your issue. I need to think how to reproduce stop
> failure.
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.