[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Sat Feb 27 01:55:06 EST 2021

On 27.02.2021 09:05, Eric Robinson wrote:
>> -----Original Message-----
>> From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
>> Borzenkov
>> Sent: Friday, February 26, 2021 1:25 PM
>> To: users at clusterlabs.org
>> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
>> Down Anyway?
>>
>> On 26.02.2021 21:58, Eric Robinson wrote:
>>>> -----Original Message-----
>>>> From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
>>>> Borzenkov
>>>> Sent: Friday, February 26, 2021 11:27 AM
>>>> To: users at clusterlabs.org
>>>> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice
>>>> Went Down Anyway?
>>>>
>>>> 26.02.2021 19:19, Eric Robinson пишет:
>>>>> At 5:16 am Pacific time Monday, one of our cluster nodes failed and
>>>>> its
>>>> mysql services went down. The cluster did not automatically recover.
>>>>>
>>>>> We're trying to figure out:
>>>>>
>>>>>
>>>>>   1.  Why did it fail?
>>>>
>>>> Pacemaker only registered loss of connection between two nodes. You
>>>> need to investigate why it happened.
>>>>
>>>>>   2.  Why did it not automatically recover?
>>>>>
>>>>> The cluster did not recover until we manually executed...
>>>>>
>>>>
>>>> *Cluster* never failed in the first place. Specific resource may. Do
>>>> not confuse things more than is necessary.
>>>>
>>>>> # pcs resource cleanup p_mysql_622
>>>>>
>>>>
>>>> Because this resource failed to stop and this is fatal.
>>>>
>>>>> Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:        *
>> Stop
>>>> p_mysql_622      (                 001db01a )   due to no quorum
>>>>
>>>> Remaining node lost quorum and decided to stop resources
>>>>
>>>
>>> I consider this a cluster failure, exacerbated by a resource failure. We can
>> investigate why resource p_mysql_622 failed to stop, but it seems the
>> underlying problem is the loss of quorum.
>>
>> This problem is outside of pacemaker scope. You are shooting the messenger
>> here.
>>
> 
> I appreciate your patience here. Here is my confusion. We have three devices--two database servers and a qdevice. Unless two devices lost connection with the network at the same time, the cluster should not have lost quorum. 

No, you misunderstand how qdevice works. qdevice is not passive witness
- when cluster is split in multiple partitions, qdevice decides which
partition should remain active and provides votes to this partition so
it remains quorate. All other partitions will go out of quorum.

So even if only connection between two nodes was lost, but both nodes
retained connection to qnetd server, one node is expected to go out of
quorum.

> If node 001db01a lost all connectivity (and therefore quorum), then I understand that the default Pacemaker action would be to stop its services. However, that does not explain why node 001db01b did not take over and start the services, as it would still have had quorum.

You really need to show your corosync and pacemaker configuration.

> 
>>> That should not have happened with the qdevice in the mix, should it?
>>>
>>
>> Huh? It is up to you to provide infrastructure where qdevice connection
>> never fails. Again this is outside of pacemaker scope.
>>
> 
> Does something in the logs indicate that BOTH database nodes lost quorum?

Not that I can see; second node apparently remained in quorum.

> Are you suggesting that Azure's network went down and all the devices lost communication with each other, and that's why quorum was lost?

Communication between two pacemaker nodes was definitely lost, at least
from pacemaker point of view. Communication to qnetd server may have
been lost, but it does not change the end result - one node was expected
to go out of quorum.

> 
>>> I'm confused about what is supposed to happen here. If the root cause is
>> that node 001db01a briefly lost all communication with the network (just
>> guessing), then it should have taken no action, including STONITH, since
>> there would be no quorum.
>>
>> Read pacemaker documentation. Default action when node goes out of
>> quorum is to stop all resources.
>>
>>> (There is no physical STONITH device anyway, as both nodes are in Azure.)
>> Meanwhile, node 001db01b would still have had quorum (itself plus the
>> qdevice), and should have assumed ownership of the resources and started
>> them, or no?
>>
>> I commented on this in another mail. pacemaker documentation does not
>> really describe what happens, and blindly restarting all resources locally
>> would easily lead to data corruption.
>>
>> Having STONITH would solve your problem. 001db01b would have fenced
>> 001db01a and restarted all resources.
>>
>> Without STONITH it is not possible in general to handle split brain and
>> resource stop failures. You do not know what is left active and what not so it
>> is not safe to attempt to restart resources elsewhere.
> 
> The nodes are using DRBD. Since that has its own split-brain detection, I don't think there is a concern about data corruption as there would be with shared storage.

To my best knowledge to *resolve* DRBD split brain you need fencing. But
I do not have first hand experience with DRBD, so cannot comment here.

> In a scenario where node 001db01a loses connectivity, 001db01b still has quorum because of the vote from the qdevice. It should promote DRBD and start the mysql services. If 001db01a subsequently comes back online, then both DRBD devices go into standalone and the services go back down, but there's no corruption. You then do a manual split-brain recovery (discard data on 001db01a) and you're back up.
> 
> I don't see how STONITH makes things stable in this scenario. If all the nodes lose quorum, would they take STONITH action? If so, which node would in? I'm worried about enabling STONITH because unless we understand why the nodes lost quorum, don't we run the risk of random unwanted STONITH events?

Quorum is not replacement for fencing. Actually HA cluster does not need
quorum at all - all that it needs is fencing. All of two node
heartbeat/pacemaker clusters I have been using for the past decade had
no-quorum-policy=ignore and corosync two_node option also does exactly
that - it *fakes* quorum just to please default pacemaker
no-quorum-policy value.

Quorum provides one possibility to chose which node(s) should be left
running. It still does not mean it is safe to take over resources from
remaining nodes. Even without shared storage, consider trivial case of
duplicated IP address.

What quorum makes possible is self-fencing. Nodes that go out of quorum
commit suicide and *that* enables quorate partition to assume "clean
state" and start takeover (and *NOT* the fact that remaining partition
is quorate).

Most commercial HA managers I am aware of work with self-fencing and do
not even offer possibility to use anything else. This is probably what
made quorum idea so deep ingrained in people brains - because every
documentation you read goes about need to have quorum without actually
explaining *why* you need to have quorum.

In case of pacemaker quorate partition will initiate fencing of other
nodes and only after fencing has been successful will continue with
taking over their resources. Pacemaker also supports self-fencing via
SBD watchdog if no external fencing mechanism is possible.