[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Fri Feb 26 14:24:47 EST 2021

On 26.02.2021 21:58, Eric Robinson wrote:
>> -----Original Message-----
>> From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
>> Borzenkov
>> Sent: Friday, February 26, 2021 11:27 AM
>> To: users at clusterlabs.org
>> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
>> Down Anyway?
>>
>> 26.02.2021 19:19, Eric Robinson пишет:
>>> At 5:16 am Pacific time Monday, one of our cluster nodes failed and its
>> mysql services went down. The cluster did not automatically recover.
>>>
>>> We're trying to figure out:
>>>
>>>
>>>   1.  Why did it fail?
>>
>> Pacemaker only registered loss of connection between two nodes. You need
>> to investigate why it happened.
>>
>>>   2.  Why did it not automatically recover?
>>>
>>> The cluster did not recover until we manually executed...
>>>
>>
>> *Cluster* never failed in the first place. Specific resource may. Do not
>> confuse things more than is necessary.
>>
>>> # pcs resource cleanup p_mysql_622
>>>
>>
>> Because this resource failed to stop and this is fatal.
>>
>>> Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:        * Stop
>> p_mysql_622      (                 001db01a )   due to no quorum
>>
>> Remaining node lost quorum and decided to stop resources
>>
> 
> I consider this a cluster failure, exacerbated by a resource failure. We can investigate why resource p_mysql_622 failed to stop, but it seems the underlying problem is the loss of quorum. 

This problem is outside of pacemaker scope. You are shooting the
messenger here.

> That should not have happened with the qdevice in the mix, should it?
> 

Huh? It is up to you to provide infrastructure where qdevice connection
never fails. Again this is outside of pacemaker scope.

> I'm confused about what is supposed to happen here. If the root cause is that node 001db01a briefly lost all communication with the network (just guessing), then it should have taken no action, including STONITH, since there would be no quorum.

Read pacemaker documentation. Default action when node goes out of
quorum is to stop all resources.

> (There is no physical STONITH device anyway, as both nodes are in Azure.) Meanwhile, node 001db01b would still have had quorum (itself plus the qdevice), and should have assumed ownership of the resources and started them, or no?

I commented on this in another mail. pacemaker documentation does not
really describe what happens, and blindly restarting all resources
locally would easily lead to data corruption.

Having STONITH would solve your problem. 001db01b would have fenced
001db01a and restarted all resources.

Without STONITH it is not possible in general to handle split brain and
resource stop failures. You do not know what is left active and what not
so it is not safe to attempt to restart resources elsewhere.