[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Sat Feb 27 09:08:12 EST 2021

> -----Original Message-----
> From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
> Borzenkov
> Sent: Saturday, February 27, 2021 12:55 AM
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
> Down Anyway?
>
> On 27.02.2021 09:05, Eric Robinson wrote:
> >> -----Original Message-----
> >> From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
> >> Borzenkov
> >> Sent: Friday, February 26, 2021 1:25 PM
> >> To: users at clusterlabs.org
> >> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice
> >> Went Down Anyway?
> >>
> >> On 26.02.2021 21:58, Eric Robinson wrote:
> >>>> -----Original Message-----
> >>>> From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
> >>>> Borzenkov
> >>>> Sent: Friday, February 26, 2021 11:27 AM
> >>>> To: users at clusterlabs.org
> >>>> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate
> >>>> Qdevice Went Down Anyway?
> >>>>
> >>>> 26.02.2021 19:19, Eric Robinson пишет:
> >>>>> At 5:16 am Pacific time Monday, one of our cluster nodes failed
> >>>>> and its
> >>>> mysql services went down. The cluster did not automatically recover.
> >>>>>
> >>>>> We're trying to figure out:
> >>>>>
> >>>>>
> >>>>>   1.  Why did it fail?
> >>>>
> >>>> Pacemaker only registered loss of connection between two nodes. You
> >>>> need to investigate why it happened.
> >>>>
> >>>>>   2.  Why did it not automatically recover?
> >>>>>
> >>>>> The cluster did not recover until we manually executed...
> >>>>>
> >>>>
> >>>> *Cluster* never failed in the first place. Specific resource may.
> >>>> Do not confuse things more than is necessary.
> >>>>
> >>>>> # pcs resource cleanup p_mysql_622
> >>>>>
> >>>>
> >>>> Because this resource failed to stop and this is fatal.
> >>>>
> >>>>> Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:        *
> >> Stop
> >>>> p_mysql_622      (                 001db01a )   due to no quorum
> >>>>
> >>>> Remaining node lost quorum and decided to stop resources
> >>>>
> >>>
> >>> I consider this a cluster failure, exacerbated by a resource
> >>> failure. We can
> >> investigate why resource p_mysql_622 failed to stop, but it seems the
> >> underlying problem is the loss of quorum.
> >>
> >> This problem is outside of pacemaker scope. You are shooting the
> >> messenger here.
> >>
> >
> > I appreciate your patience here. Here is my confusion. We have three
> devices--two database servers and a qdevice. Unless two devices lost
> connection with the network at the same time, the cluster should not have
> lost quorum.
>
> No, you misunderstand how qdevice works. qdevice is not passive witness
> - when cluster is split in multiple partitions, qdevice decides which partition
> should remain active and provides votes to this partition so it remains
> quorate. All other partitions will go out of quorum.
>
> So even if only connection between two nodes was lost, but both nodes
> retained connection to qnetd server, one node is expected to go out of
> quorum.

I must have not explained myself well earlier, because what you wrote is exactly how I understand it. Qdevice is an active participant that provides a vote to ensure that at least one partition remains quorate.

>
> > If node 001db01a lost all connectivity (and therefore quorum), then I
> understand that the default Pacemaker action would be to stop its services.
> However, that does not explain why node 001db01b did not take over and
> start the services, as it would still have had quorum.
>
> You really need to show your corosync and pacemaker configuration.
>

You can see the cluster config here: https://www.dropbox.com/s/9t5ecl2pjf9yu2o/cluster_config.txt?dl=0

> >
> >>> That should not have happened with the qdevice in the mix, should it?
> >>>
> >>
> >> Huh? It is up to you to provide infrastructure where qdevice
> >> connection never fails. Again this is outside of pacemaker scope.
> >>
> >
> > Does something in the logs indicate that BOTH database nodes lost
> quorum?
>
> Not that I can see; second node apparently remained in quorum.
>
> > Are you suggesting that Azure's network went down and all the devices
> lost communication with each other, and that's why quorum was lost?
>
> Communication between two pacemaker nodes was definitely lost, at least
> from pacemaker point of view. Communication to qnetd server may have
> been lost, but it does not change the end result - one node was expected to
> go out of quorum.
>

I agree, one node is expected to go out of quorum. Still the question is, why didn't 001db01b take over the services? I just remembered that 001db01b has services running on it, and those services did not stop, so it seems that 001db01b did not lose quorum. So why didn't it take over the services that were running on 001db01a?

> >
> >>> I'm confused about what is supposed to happen here. If the root
> >>> cause is
> >> that node 001db01a briefly lost all communication with the network
> >> (just guessing), then it should have taken no action, including
> >> STONITH, since there would be no quorum.
> >>
> >> Read pacemaker documentation. Default action when node goes out of
> >> quorum is to stop all resources.
> >>
> >>> (There is no physical STONITH device anyway, as both nodes are in
> >>> Azure.)
> >> Meanwhile, node 001db01b would still have had quorum (itself plus the
> >> qdevice), and should have assumed ownership of the resources and
> >> started them, or no?
> >>
> >> I commented on this in another mail. pacemaker documentation does not
> >> really describe what happens, and blindly restarting all resources
> >> locally would easily lead to data corruption.
> >>
> >> Having STONITH would solve your problem. 001db01b would have fenced
> >> 001db01a and restarted all resources.
> >>
> >> Without STONITH it is not possible in general to handle split brain
> >> and resource stop failures. You do not know what is left active and
> >> what not so it is not safe to attempt to restart resources elsewhere.
> >
> > The nodes are using DRBD. Since that has its own split-brain detection, I
> don't think there is a concern about data corruption as there would be with
> shared storage.
>
> To my best knowledge to *resolve* DRBD split brain you need fencing. But I
> do not have first hand experience with DRBD, so cannot comment here.
>

I can help with that. We've been using DRBD for 14 years, and I can tell you from much experience that DRBD does not need STONITH to resolve split brain. When DRBD detects split brain, both nodes go into standalone mode and replications stops. The administrator must then manually tell the cluster which node should go forward as the master.

> > In a scenario where node 001db01a loses connectivity, 001db01b still has
> quorum because of the vote from the qdevice. It should promote DRBD and
> start the mysql services. If 001db01a subsequently comes back online, then
> both DRBD devices go into standalone and the services go back down, but
> there's no corruption. You then do a manual split-brain recovery (discard data
> on 001db01a) and you're back up.
> >
> > I don't see how STONITH makes things stable in this scenario. If all the
> nodes lose quorum, would they take STONITH action? If so, which node
> would in? I'm worried about enabling STONITH because unless we
> understand why the nodes lost quorum, don't we run the risk of random
> unwanted STONITH events?
>
> Quorum is not replacement for fencing. Actually HA cluster does not need
> quorum at all - all that it needs is fencing. All of two node
> heartbeat/pacemaker clusters I have been using for the past decade had no-
> quorum-policy=ignore and corosync two_node option also does exactly that
> - it *fakes* quorum just to please default pacemaker no-quorum-policy
> value.
>

That's also how I ran my 2-node clusters for the past 14 years as well (starting with heartbeat, then corosync+pacemaker). However, everyone always pointed out that a 2-node cluster isn't really a cluster at all since there is no way to provide real quorum. That's why I was excited to deploy device.

> Quorum provides one possibility to chose which node(s) should be left
> running. It still does not mean it is safe to take over resources from remaining
> nodes. Even without shared storage, consider trivial case of duplicated IP
> address.
>

Duplicate IPs are another concern that doesn't exist in Azure. Cluster virtual IPs don't work in Azure because the IPs must be assigned in the Azure fabric. You can put as many IPs on a node as you want, but other servers swon't be able to ping them until you assign those IPs in Azure as well.

Basically, it seems that the main concerns that STONITH is designed to prevent--mainly corruption of shared storage--are not concerns in my environment.

> What quorum makes possible is self-fencing. Nodes that go out of quorum
> commit suicide and *that* enables quorate partition to assume "clean state"
> and start takeover (and *NOT* the fact that remaining partition is quorate).
>

In a scenario where node 001db01a has lost network connectivity, there is no way for node 001db01b to know that 001db01a has fenced itself. So the question is, would  001db01b refuse to take over services from 001db01a unless it can STONITH it?

> Most commercial HA managers I am aware of work with self-fencing and do
> not even offer possibility to use anything else. This is probably what made
> quorum idea so deep ingrained in people brains - because every
> documentation you read goes about need to have quorum without actually
> explaining *why* you need to have quorum.
>
> In case of pacemaker quorate partition will initiate fencing of other nodes
> and only after fencing has been successful will continue with taking over their
> resources. Pacemaker also supports self-fencing via SBD watchdog if no
> external fencing mechanism is possible.

This is confusing because we never use fencing, but our cluster services do failover if you pull the power plug from one of them.

> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.