[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?

Fri Feb 26 13:58:44 EST 2021

> -----Original Message-----
> From: Users <users-bounces at clusterlabs.org> On Behalf Of Andrei
> Borzenkov
> Sent: Friday, February 26, 2021 11:27 AM
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went
> Down Anyway?
>
> 26.02.2021 19:19, Eric Robinson пишет:
> > At 5:16 am Pacific time Monday, one of our cluster nodes failed and its
> mysql services went down. The cluster did not automatically recover.
> >
> > We're trying to figure out:
> >
> >
> >   1.  Why did it fail?
>
> Pacemaker only registered loss of connection between two nodes. You need
> to investigate why it happened.
>
> >   2.  Why did it not automatically recover?
> >
> > The cluster did not recover until we manually executed...
> >
>
> *Cluster* never failed in the first place. Specific resource may. Do not
> confuse things more than is necessary.
>
> > # pcs resource cleanup p_mysql_622
> >
>
> Because this resource failed to stop and this is fatal.
>
> > Feb 22 05:16:30 [91682] 001db01a    pengine:   notice: LogAction:        * Stop
> p_mysql_622      (                 001db01a )   due to no quorum
>
> Remaining node lost quorum and decided to stop resources
>

I consider this a cluster failure, exacerbated by a resource failure. We can investigate why resource p_mysql_622 failed to stop, but it seems the underlying problem is the loss of quorum. That should not have happened with the qdevice in the mix, should it?

I'm confused about what is supposed to happen here. If the root cause is that node 001db01a briefly lost all communication with the network (just guessing), then it should have taken no action, including STONITH, since there would be no quorum. (There is no physical STONITH device anyway, as both nodes are in Azure.) Meanwhile, node 001db01b would still have had quorum (itself plus the qdevice), and should have assumed ownership of the resources and started them, or no?

> > Feb 22 05:16:30 [91683] 001db01a       crmd:   notice: te_rsc_command:
> Initiating stop operation p_mysql_622_stop_0 locally on 001db01a | action 76
> ...
> > Feb 22 05:16:30 [91680] 001db01a       lrmd:     info: log_execute:     executing
> - rsc:p_mysql_622 action:stop call_id:308
> ...
> > Feb 22 05:16:45 [91680] 001db01a       lrmd:  warning:
> child_timeout_callback:  p_mysql_622_stop_0 process (PID 19225) timed out
> > Feb 22 05:16:45 [91680] 001db01a       lrmd:  warning: operation_finished:
> p_mysql_622_stop_0:19225 - timed out after 15000ms
> > Feb 22 05:16:45 [91680] 001db01a       lrmd:     info: log_finished:    finished -
> rsc:p_mysql_622 action:stop call_id:308 pid:19225 exit-code:1 exec-
> time:15002ms queue-time:0ms
> > Feb 22 05:16:45 [91683] 001db01a       crmd:    error: process_lrm_event:
> Result of stop operation for p_mysql_622 on 001db01a: Timed Out | call=308
> key=p_mysql_622_stop_0 timeout=15000ms
> ...
> > Feb 22 05:16:38 [112948] 001db01b    pengine:     info: LogActions:     Leave
> p_mysql_622     (Started unmanaged)
>
> At this point pacemaker stops managing this resource because its status is
> unknown. Normal reaction to stop failure is to fence node and fail resource
> over, but apparently you also do not ave working stonith.
>
> Loss of quorum may be related to network issue so that nodes both lost
> connection to each other and to quorum device.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
Disclaimer : This email and any files transmitted with it are confidential and intended solely for intended recipients. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physician Select Management. Warning: Although Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.