[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Fri Feb 26 11:34:36 EST 2021

On 2021-02-26 11:19 a.m., Eric Robinson wrote:
> At 5:16 am Pacific time Monday, one of our cluster nodes failed and its
> mysql services went down. The cluster did not automatically recover.
> 
> We’re trying to figure out:
> 
>  1. Why did it fail?
>  2. Why did it not automatically recover?
> 
> The cluster did not recover until we manually executed…
> 
> # pcs resource cleanup p_mysql_622
> 
> OS: CentOS Linux release 7.5.1804 (Core)
> 
> Cluster version:
> 
> corosync.x86_64                  2.4.5-4.el7                     @base
> corosync-qdevice.x86_64          2.4.5-4.el7                     @base
> pacemaker.x86_64                 1.1.21-4.el7                    @base
> 
> Two nodes: 001db01a, 001db01b
> 
> The following log snippet is from node 001db01a:
> 
> [root at 001db01a cluster]# grep "Feb 22 05:1[67]" corosync.log-20210223

<snip>

> Feb 22 05:16:30 [91682] 001db01a    pengine:  warning: cluster_status:  Fencing and resource management disabled due to lack of quorum

Seems like there was no quorum from this node's perspective, so it won't
do anything. What does the other node's logs say?

What is the cluster configuration? Do you have stonith (fencing)
configured? Quorum is a useful tool when things are working properly,
but it doesn't help when things enter an undefined / unexpected state.
When that happens, stonith saves you. So said another way, you must have
stonith for a stable cluster, quorum is optional.

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould