[ClusterLabs] Antw: [EXT] Re: Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Mon Mar 1 02:23:29 EST 2021

>>> Eric Robinson <eric.robinson at psmnv.com> schrieb am 26.02.2021 um 19:58 in
Nachricht
<SA2PR03MB58848961B4844A6BE699D73FFA9D9 at SA2PR03MB5884.namprd03.prod.outlook.com>

>>  -----Original Message-----
>> From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
>> Borzenkov
>> Sent: Friday, February 26, 2021 11:27 AM
>> To: users at clusterlabs.org 
>> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
>> Down Anyway?
>>
>> 26.02.2021 19:19, Eric Robinson пишет:
>> > At 5:16 am Pacific time Monday, one of our cluster nodes failed and its
>> mysql services went down. The cluster did not automatically recover.
>> >
>> > We're trying to figure out:
>> >
>> >
>> >   1.  Why did it fail?
>>
>> Pacemaker only registered loss of connection between two nodes. You need
>> to investigate why it happened.
>>
>> >   2.  Why did it not automatically recover?
>> >
>> > The cluster did not recover until we manually executed...
>> >
>>
>> *Cluster* never failed in the first place. Specific resource may. Do not
>> confuse things more than is necessary.
>>
>> > # pcs resource cleanup p_mysql_622
>> >
>>
>> Because this resource failed to stop and this is fatal.
>>
>> > Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:       
* 
> Stop
>> p_mysql_622      (                 001db01a )   due to no quorum
>>
>> Remaining node lost quorum and decided to stop resources
>>
> 
> I consider this a cluster failure, exacerbated by a resource failure. We can

> investigate why resource p_mysql_622 failed to stop, but it seems the 
> underlying problem is the loss of quorum. That should not have happened with

> the qdevice in the mix, should it?
> 
> I'm confused about what is supposed to happen here. If the root cause is 
> that node 001db01a briefly lost all communication with the network (just 
> guessing), then it should have taken no action, including STONITH, since 
> there would be no quorum. (There is no physical STONITH device anyway, as 
> both nodes are in Azure.) Meanwhile, node 001db01b would still have had 
> quorum (itself plus the qdevice), and should have assumed ownership of the 
> resources and started them, or no?

I agree so far, but to start the resources, the cluster must make sure that
the lost nodes does not ot no longer run them. That is where SNONITH comes into
play...

Without stonith you can guess what happens.

Regards,
Ulrich