[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Fri Feb 26 13:35:50 EST 2021

26.02.2021 20:23, Eric Robinson пишет:
>> -----Original Message-----
>> From: Digimer <lists at alteeve.ca>
>> Sent: Friday, February 26, 2021 10:35 AM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> <users at clusterlabs.org>; Eric Robinson <eric.robinson at psmnv.com>
>> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
>> Down Anyway?
>>
>> On 2021-02-26 11:19 a.m., Eric Robinson wrote:
>>> At 5:16 am Pacific time Monday, one of our cluster nodes failed and
>>> its mysql services went down. The cluster did not automatically recover.
>>>
>>> We're trying to figure out:
>>>
>>>  1. Why did it fail?
>>>  2. Why did it not automatically recover?
>>>
>>> The cluster did not recover until we manually executed...
>>>
>>> # pcs resource cleanup p_mysql_622
>>>
>>> OS: CentOS Linux release 7.5.1804 (Core)
>>>
>>> Cluster version:
>>>
>>> corosync.x86_64                  2.4.5-4.el7                     @base
>>> corosync-qdevice.x86_64          2.4.5-4.el7                     @base
>>> pacemaker.x86_64                 1.1.21-4.el7                    @base
>>>
>>> Two nodes: 001db01a, 001db01b
>>>
>>> The following log snippet is from node 001db01a:
>>>
>>> [root at 001db01a cluster]# grep "Feb 22 05:1[67]" corosync.log-20210223
>>
>> <snip>
>>
>>> Feb 22 05:16:30 [91682] 001db01a    pengine:  warning: cluster_status:
>> Fencing and resource management disabled due to lack of quorum
>>
>> Seems like there was no quorum from this node's perspective, so it won't do
>> anything. What does the other node's logs say?
>>
> 
> The logs from the other node are at the bottom of the original email.
> 
>> What is the cluster configuration? Do you have stonith (fencing) configured?
> 
> 2-node with a separate qdevice. No fencing.
> 

I wonder what is expected behavior in this case; pacemaker documentation
is rather silent. It explains what happens on nodes out of quorum, but
it is unclear whether (and when) quorate nodes will takeover resources
from nodes out of quorum. In this case 001db01b does not seem to do
anything at all for 15 seconds (while 001db01a begins stopping
resources) until 001db01a reappears

Feb 22 05:15:56 [112947] 001db01b      attrd:     info:
pcmk_cpg_membership:    Group attrd event 15: 001db01a (node 1 pid
91681) left via cluster exit
...
Feb 22 05:15:56 [112943] 001db01b pacemakerd:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=21): Try again (6)
...
Feb 22 05:16:11 [112947] 001db01b      attrd:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=424856): Try again (6)
Feb 22 05:16:11 [112945] 001db01b stonith-ng:     info:
pcmk_cpg_membership:    Group stonith-ng event 16: node 1 pid 91679
joined via cluster join

BTW time on nodes seem to be 30 seconds off.

>> Quorum is a useful tool when things are working properly, but it doesn't help
>> when things enter an undefined / unexpected state.
>> When that happens, stonith saves you. So said another way, you must have
>> stonith for a stable cluster, quorum is optional.
>>
> 
> In this case, if fencing was enabled, which node would have fenced the other? Would they have gotten into a STONITH war?
> 

looks like 001db01b retained quorum, so it would have fenced 001db01a.

> More importantly, why did the failure of resource p_mysql_622 keep the whole cluster from recovering? 

The resources on 001db01b continued to be up as far as I can tell. So
"the whole cluster" is exaggeration. The 001db01a tried to stop
resources due to quorum loss:

Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:
 * Stop       p_fs_clust01     (                 001db01a )   due to no
quorum
* Stop       p_mysql_001      (                 001db01a )   due to no
quorum
Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:
 * Stop       p_mysql_000      (                 001db01a )   due to no
quorum
Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:
 * Stop       p_mysql_002      (                 001db01a )   due to no
quorum
Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:
 * Stop       p_mysql_003      (                 001db01a )   due to no
quorum
Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:
 * Stop       p_mysql_004      (                 001db01a )   due to no
quorum
Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:
 * Stop       p_mysql_005      (                 001db01a )   due to no
quorum
Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:
 * Stop       p_mysql_622      (                 001db01a )   due to no
quorum

> As soon as I did 'pcs resource cleanup p_mysql_622' all the other resources recovered, but none of them are dependent on that resource.
> 

Logs do not contain entries for this, but my guess is that p_mysql_622
depends on p_fs_clust01 and failure to stop p_mysql_622 resulted in
blocking further actions for p_fs_clust01, so resources remained stopped

Feb 22 05:16:38 [112948] 001db01b    pengine:   notice: LogAction:
 * Stop       p_fs_clust01     (                 001db01a )   blocked
Feb 22 05:16:38 [112948] 001db01b    pengine:   notice: LogAction:
 * Start      p_mysql_001      (                 001db01b )   due to
colocation with p_fs_clust01 (blocked)
Feb 22 05:16:38 [112948] 001db01b    pengine:   notice: LogAction:
 * Start      p_mysql_000      (                 001db01b )   due to
colocation with p_fs_clust01 (blocked)
Feb 22 05:16:38 [112948] 001db01b    pengine:   notice: LogAction:
 * Start      p_mysql_002      (                 001db01b )   due to
colocation with p_fs_clust01 (blocked)
Feb 22 05:16:38 [112948] 001db01b    pengine:   notice: LogAction:
 * Start      p_mysql_003      (                 001db01b )   due to
colocation with p_fs_clust01 (blocked)
Feb 22 05:16:38 [112948] 001db01b    pengine:   notice: LogAction:
 * Start      p_mysql_004      (                 001db01b )   due to
colocation with p_fs_clust01 (blocked)
Feb 22 05:16:38 [112948] 001db01b    pengine:   notice: LogAction:
 * Start      p_mysql_005      (                 001db01b )   due to
colocation with p_fs_clust01 (blocked)