[ClusterLabs] [PaceMaker] Help troubleshooting frequent disjoin issue

Tue Mar 21 06:00:33 EDT 2023

Hi,

On Tue, 21 Mar 2023 09:33:04 +0100
Jérôme BECOT <jerome.becot at deveryware.com> wrote:

> We have several clusters running for different zabbix components. Some 
> of these clusters consist of 2 zabbix proxies,where nodes run Mysql, 
> Zabbix-proxy server and a VIP, and a corosync-qdevice. 

I'm not sure to understand your topology. The corosync-device is not supposed
to be on a cluster node. It is supposed to be on a remote node and provide some
quorum features to one or more cluster without setting up the whole
pacemaker/corosync stack.

> The MySQL servers are always up to replicate, and are configured in
> Master/Master (they both replicate from the other but only one is supposed to
> be updated by the proxy running on the master node).

Why do you bother with Master/Master when a simple (I suppose, I'm not a MySQL
cluster guy) Primary-Secondary topology or even a shared storage would be
enough and would keep your logic (writes on one node only) safe from incidents,
failures, errors, etc?

HA must be a simple as possible. Remove useless parts when you can.

> One cluster is prompt to frequent sync errors, with duplicate entries 
> errors in SQL. When I look at the logs, I can see "Mar 21 09:11:41 
> zabbix-proxy-01 pacemaker-controld  [948] (pcmk_cpg_membership)     
> info: Group crmd event 89: zabbix-proxy-02 (node 2 pid 967) left via 
> cluster exit", and within the next second, a rejoin. The same messages 
> are in the other node logs, suggesting a split brain, which should not 
> happen, because there is a quorum device.

Would it be possible your SQL sync errors and the left/join issues are
correlated and are both symptoms of another failure? Look at your log for some
explanation about why the node decided to leave the cluster.

> Can you help me to troubleshoot this ? I can provide any 
> log/configuration required in the process, so let me know.
> 
> I'd also like to ask if there is a bit of configuration that can be done 
> to postpone service start on the other node for two or three seconds as 
> a quick workaround ?

How would it be a workaround?

Regards,