[ClusterLabs] [PaceMaker] Help troubleshooting frequent disjoin issue

Tue Mar 21 06:47:23 EDT 2023

Le 21/03/2023 à 11:00, Jehan-Guillaume de Rorthais a écrit :
> Hi,
>
> On Tue, 21 Mar 2023 09:33:04 +0100
> Jérôme BECOT<jerome.becot at deveryware.com>  wrote:
>
>> We have several clusters running for different zabbix components. Some
>> of these clusters consist of 2 zabbix proxies,where nodes run Mysql,
>> Zabbix-proxy server and a VIP, and a corosync-qdevice.
> I'm not sure to understand your topology. The corosync-device is not supposed
> to be on a cluster node. It is supposed to be on a remote node and provide some
> quorum features to one or more cluster without setting up the whole
> pacemaker/corosync stack.
I was not clear, the qdevice is deployed on a remote node, as intended.
>
>> The MySQL servers are always up to replicate, and are configured in
>> Master/Master (they both replicate from the other but only one is supposed to
>> be updated by the proxy running on the master node).
> Why do you bother with Master/Master when a simple (I suppose, I'm not a MySQL
> cluster guy) Primary-Secondary topology or even a shared storage would be
> enough and would keep your logic (writes on one node only) safe from incidents,
> failures, errors, etc?
>
> HA must be a simple as possible. Remove useless parts when you can.
A shared storage moves the complexity somewhere else. A classic Primary 
/ secondary can be an option if PaceMaker manages to start the client on 
the slave node, but it would become Master/Master during the split brain.
>
>> One cluster is prompt to frequent sync errors, with duplicate entries
>> errors in SQL. When I look at the logs, I can see "Mar 21 09:11:41
>> zabbix-proxy-01 pacemaker-controld  [948] (pcmk_cpg_membership)
>> info: Group crmd event 89: zabbix-proxy-02 (node 2 pid 967) left via
>> cluster exit", and within the next second, a rejoin. The same messages
>> are in the other node logs, suggesting a split brain, which should not
>> happen, because there is a quorum device.
> Would it be possible your SQL sync errors and the left/join issues are
> correlated and are both symptoms of another failure? Look at your log for some
> explanation about why the node decided to leave the cluster.

My guess is that maybe a high latency in network cause the disjoin, 
hence starting Zabbix-proxy on both nodes causes the replication error. 
It is configured to use the vip which is up locally because there is a 
split brain.

This is why I'm requesting guidance to check/monitor these nodes to find 
out if it is temporary network latency that is causing the disjoin.

>
>> Can you help me to troubleshoot this ? I can provide any
>> log/configuration required in the process, so let me know.
>>
>> I'd also like to ask if there is a bit of configuration that can be done
>> to postpone service start on the other node for two or three seconds as
>> a quick workaround ?
> How would it be a workaround?
Because if network issues persist, the proxy would not be started on the 
slave node, as the disjoin just last for less than two seconds. Fixing 
the network is the solution (but not in my power), delaying the service 
start in this case looks like a decent workaround for me.
>
> Regards,
-- 
*Jérôme BECOT*
Ingénieur DevOps Infrastructure

Téléphone fixe: 01 82 28 37 06
Mobile : +33 757 173 193
Deveryware - 43 rue Taitbout - 75009 PARIS
https://www.deveryware.com <https://www.deveryware.com>

Deveryware_Logo
<https://www.deveryware.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230321/ea9f197c/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: baniere_signature_dw_2022.png
Type: image/png
Size: 471995 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230321/ea9f197c/attachment-0001.png>