[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Sat Feb 27 01:05:06 EST 2021

> -----Original Message-----
> From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
> Borzenkov
> Sent: Friday, February 26, 2021 1:25 PM
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
> Down Anyway?
>
> On 26.02.2021 21:58, Eric Robinson wrote:
> >> -----Original Message-----
> >> From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
> >> Borzenkov
> >> Sent: Friday, February 26, 2021 11:27 AM
> >> To: users at clusterlabs.org
> >> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice
> >> Went Down Anyway?
> >>
> >> 26.02.2021 19:19, Eric Robinson пишет:
> >>> At 5:16 am Pacific time Monday, one of our cluster nodes failed and
> >>> its
> >> mysql services went down. The cluster did not automatically recover.
> >>>
> >>> We're trying to figure out:
> >>>
> >>>
> >>>   1.  Why did it fail?
> >>
> >> Pacemaker only registered loss of connection between two nodes. You
> >> need to investigate why it happened.
> >>
> >>>   2.  Why did it not automatically recover?
> >>>
> >>> The cluster did not recover until we manually executed...
> >>>
> >>
> >> *Cluster* never failed in the first place. Specific resource may. Do
> >> not confuse things more than is necessary.
> >>
> >>> # pcs resource cleanup p_mysql_622
> >>>
> >>
> >> Because this resource failed to stop and this is fatal.
> >>
> >>> Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:        *
> Stop
> >> p_mysql_622      (                 001db01a )   due to no quorum
> >>
> >> Remaining node lost quorum and decided to stop resources
> >>
> >
> > I consider this a cluster failure, exacerbated by a resource failure. We can
> investigate why resource p_mysql_622 failed to stop, but it seems the
> underlying problem is the loss of quorum.
>
> This problem is outside of pacemaker scope. You are shooting the messenger
> here.
>

I appreciate your patience here. Here is my confusion. We have three devices--two database servers and a qdevice. Unless two devices lost connection with the network at the same time, the cluster should not have lost quorum. If node 001db01a lost all connectivity (and therefore quorum), then I understand that the default Pacemaker action would be to stop its services. However, that does not explain why node 001db01b did not take over and start the services, as it would still have had quorum.

> > That should not have happened with the qdevice in the mix, should it?
> >
>
> Huh? It is up to you to provide infrastructure where qdevice connection
> never fails. Again this is outside of pacemaker scope.
>

Does something in the logs indicate that BOTH database nodes lost quorum? Are you suggesting that Azure's network went down and all the devices lost communication with each other, and that's why quorum was lost?

> > I'm confused about what is supposed to happen here. If the root cause is
> that node 001db01a briefly lost all communication with the network (just
> guessing), then it should have taken no action, including STONITH, since
> there would be no quorum.
>
> Read pacemaker documentation. Default action when node goes out of
> quorum is to stop all resources.
>
> > (There is no physical STONITH device anyway, as both nodes are in Azure.)
> Meanwhile, node 001db01b would still have had quorum (itself plus the
> qdevice), and should have assumed ownership of the resources and started
> them, or no?
>
> I commented on this in another mail. pacemaker documentation does not
> really describe what happens, and blindly restarting all resources locally
> would easily lead to data corruption.
>
> Having STONITH would solve your problem. 001db01b would have fenced
> 001db01a and restarted all resources.
>
> Without STONITH it is not possible in general to handle split brain and
> resource stop failures. You do not know what is left active and what not so it
> is not safe to attempt to restart resources elsewhere.

The nodes are using DRBD. Since that has its own split-brain detection, I don't think there is a concern about data corruption as there would be with shared storage. In a scenario where node 001db01a loses connectivity, 001db01b still has quorum because of the vote from the qdevice. It should promote DRBD and start the mysql services. If 001db01a subsequently comes back online, then both DRBD devices go into standalone and the services go back down, but there's no corruption. You then do a manual split-brain recovery (discard data on 001db01a) and you're back up.

I don't see how STONITH makes things stable in this scenario. If all the nodes lose quorum, would they take STONITH action? If so, which node would in? I'm worried about enabling STONITH because unless we understand why the nodes lost quorum, don't we run the risk of random unwanted STONITH events?

> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.