[ClusterLabs] [PaceMaker] Help troubleshooting frequent disjoin issue
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Tue Mar 21 06:00:33 EDT 2023
Hi,
On Tue, 21 Mar 2023 09:33:04 +0100
Jérôme BECOT <jerome.becot at deveryware.com> wrote:
> We have several clusters running for different zabbix components. Some
> of these clusters consist of 2 zabbix proxies,where nodes run Mysql,
> Zabbix-proxy server and a VIP, and a corosync-qdevice.
I'm not sure to understand your topology. The corosync-device is not supposed
to be on a cluster node. It is supposed to be on a remote node and provide some
quorum features to one or more cluster without setting up the whole
pacemaker/corosync stack.
> The MySQL servers are always up to replicate, and are configured in
> Master/Master (they both replicate from the other but only one is supposed to
> be updated by the proxy running on the master node).
Why do you bother with Master/Master when a simple (I suppose, I'm not a MySQL
cluster guy) Primary-Secondary topology or even a shared storage would be
enough and would keep your logic (writes on one node only) safe from incidents,
failures, errors, etc?
HA must be a simple as possible. Remove useless parts when you can.
> One cluster is prompt to frequent sync errors, with duplicate entries
> errors in SQL. When I look at the logs, I can see "Mar 21 09:11:41
> zabbix-proxy-01 pacemaker-controld [948] (pcmk_cpg_membership)
> info: Group crmd event 89: zabbix-proxy-02 (node 2 pid 967) left via
> cluster exit", and within the next second, a rejoin. The same messages
> are in the other node logs, suggesting a split brain, which should not
> happen, because there is a quorum device.
Would it be possible your SQL sync errors and the left/join issues are
correlated and are both symptoms of another failure? Look at your log for some
explanation about why the node decided to leave the cluster.
> Can you help me to troubleshoot this ? I can provide any
> log/configuration required in the process, so let me know.
>
> I'd also like to ask if there is a bit of configuration that can be done
> to postpone service start on the other node for two or three seconds as
> a quick workaround ?
How would it be a workaround?
Regards,
More information about the Users
mailing list