[ClusterLabs] Antw: [EXT] Re: Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Mon Mar 1 02:15:59 EST 2021

>>> Eric Robinson <eric.robinson at psmnv.com> schrieb am 26.02.2021 um 18:23 in
Nachricht
<SA2PR03MB58840FB93671D215133499D5FA9D9 at SA2PR03MB5884.namprd03.prod.outlook.com>

>>  ‑‑‑‑‑Original Message‑‑‑‑‑
>> From: Digimer <lists at alteeve.ca>
>> Sent: Friday, February 26, 2021 10:35 AM
>> To: Cluster Labs ‑ All topics related to open‑source clustering welcomed
>> <users at clusterlabs.org>; Eric Robinson <eric.robinson at psmnv.com>
>> Subject: Re: [ClusterLabs] Our 2‑Node Cluster with a Separate Qdevice Went
>> Down Anyway?
>>
>> On 2021‑02‑26 11:19 a.m., Eric Robinson wrote:
>> > At 5:16 am Pacific time Monday, one of our cluster nodes failed and
>> > its mysql services went down. The cluster did not automatically recover.
>> >
>> > We're trying to figure out:
>> >
>> >  1. Why did it fail?
>> >  2. Why did it not automatically recover?
>> >
>> > The cluster did not recover until we manually executed...
>> >
>> > # pcs resource cleanup p_mysql_622
>> >
>> > OS: CentOS Linux release 7.5.1804 (Core)
>> >
>> > Cluster version:
>> >
>> > corosync.x86_64                  2.4.5‑4.el7                     @base
>> > corosync‑qdevice.x86_64          2.4.5‑4.el7                     @base
>> > pacemaker.x86_64                 1.1.21‑4.el7                    @base
>> >
>> > Two nodes: 001db01a, 001db01b
>> >
>> > The following log snippet is from node 001db01a:
>> >
>> > [root at 001db01a cluster]# grep "Feb 22 05:1[67]" corosync.log‑20210223
>>
>> <snip>
>>
>> > Feb 22 05:16:30 [91682] 001db01a    pengine:  warning: cluster_status:
>> Fencing and resource management disabled due to lack of quorum
>>
>> Seems like there was no quorum from this node's perspective, so it won't
do
>> anything. What does the other node's logs say?
>>
> 
> The logs from the other node are at the bottom of the original email.
> 
>> What is the cluster configuration? Do you have stonith (fencing)
configured?
> 
> 2‑node with a separate qdevice. No fencing.

Maybe the docs should add that: "There always is some type of fencing." It
seems you chose the "ring the admin out of from bed" type of fencing, where the
admin has to fix the problems manually. Personally I prefer another type of
fencing ;-)

> 
>> Quorum is a useful tool when things are working properly, but it doesn't 
> help
>> when things enter an undefined / unexpected state.
>> When that happens, stonith saves you. So said another way, you must have
>> stonith for a stable cluster, quorum is optional.
>>
> 
> In this case, if fencing was enabled, which node would have fenced the 
> other? Would they have gotten into a STONITH war?

Maybe, bu teven if both nodes were fenced and restarted, there is a proability
that they would start up again when the problem did not persist. If the problem
preventing the cluster from running did persist, it would not make a big
difference if the nodes continued to shott each other or not. Actually we never
had a STONITH war.

> 
> More importantly, why did the failure of resource p_mysql_622 keep the whole

> cluster from recovering? As soon as I did 'pcs resource cleanup p_mysql_622'

> all the other resources recovered, but none of them are dependent on that 
> resource.

Lack of STONITH. Actually for SLES a cluster without working STONITH is
unsupported.

Regards,
Ulrich

> 
>> ‑‑
>> Digimer
>> Papers and Projects: https://alteeve.com/w/ "I am, somehow, less
>> interested in the weight and convolutions of Einstein's brain than in the 
> near
>> certainty that people of equal talent have lived and died in cotton fields

> and
>> sweatshops." ‑ Stephen Jay Gould
> Disclaimer : This email and any files transmitted with it are confidential 
> and intended solely for intended recipients. If you are not the named 
> addressee you should not disseminate, distribute, copy or alter this email.

> Any views or opinions presented in this email are solely those of the author

> and might not represent those of Physician Select Management. Warning: 
> Although Physician Select Management has taken reasonable precautions to 
> ensure no viruses are present in this email, the company cannot accept 
> responsibility for any loss or damage arising from the use of this email or

> attachments.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/