[ClusterLabs] [PaceMaker] Help troubleshooting frequent disjoin issue
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Tue Mar 21 07:01:48 EDT 2023
On Tue, 21 Mar 2023 11:47:23 +0100
Jérôme BECOT <jerome.becot at deveryware.com> wrote:
> Le 21/03/2023 à 11:00, Jehan-Guillaume de Rorthais a écrit :
> > Hi,
> >
> > On Tue, 21 Mar 2023 09:33:04 +0100
> > Jérôme BECOT<jerome.becot at deveryware.com> wrote:
> >
> >> We have several clusters running for different zabbix components. Some
> >> of these clusters consist of 2 zabbix proxies,where nodes run Mysql,
> >> Zabbix-proxy server and a VIP, and a corosync-qdevice.
> > I'm not sure to understand your topology. The corosync-device is not
> > supposed to be on a cluster node. It is supposed to be on a remote node and
> > provide some quorum features to one or more cluster without setting up the
> > whole pacemaker/corosync stack.
> I was not clear, the qdevice is deployed on a remote node, as intended.
ok
> >> The MySQL servers are always up to replicate, and are configured in
> >> Master/Master (they both replicate from the other but only one is supposed
> >> to be updated by the proxy running on the master node).
> > Why do you bother with Master/Master when a simple (I suppose, I'm not a
> > MySQL cluster guy) Primary-Secondary topology or even a shared storage
> > would be enough and would keep your logic (writes on one node only) safe
> > from incidents, failures, errors, etc?
> >
> > HA must be a simple as possible. Remove useless parts when you can.
> A shared storage moves the complexity somewhere else.
Yes, on storage/SAN side.
> A classic Primary / secondary can be an option if PaceMaker manages to start
> the client on the slave node,
I suppose this can be done using a location constraint.
> but it would become Master/Master during the split brain.
No, and if you do have real split brain, then you might have something wrong in
your setup. See below.
> >> One cluster is prompt to frequent sync errors, with duplicate entries
> >> errors in SQL. When I look at the logs, I can see "Mar 21 09:11:41
> >> zabbix-proxy-01 pacemaker-controld [948] (pcmk_cpg_membership)
> >> info: Group crmd event 89: zabbix-proxy-02 (node 2 pid 967) left via
> >> cluster exit", and within the next second, a rejoin. The same messages
> >> are in the other node logs, suggesting a split brain, which should not
> >> happen, because there is a quorum device.
> > Would it be possible your SQL sync errors and the left/join issues are
> > correlated and are both symptoms of another failure? Look at your log for
> > some explanation about why the node decided to leave the cluster.
>
> My guess is that maybe a high latency in network cause the disjoin,
> hence starting Zabbix-proxy on both nodes causes the replication error.
> It is configured to use the vip which is up locally because there is a
> split brain.
If you have a split brain, that means your quorum setup is failing.
No node could start/promote a resource without having the quorum. If a node is
isolated from the cluster and quorum-device, it should stop its resources, not
recover/promote them.
If both nodes lost connection with each others, but are still connected to the
quorum-device, the later should be able to grant the quorum on one side only.
Lastly, quorum is a split brain protection when "things are going fine".
Fencing is a split brain protection for all other situations. Fencing is hard
and painful, but it saves from many split brain situation.
> This is why I'm requesting guidance to check/monitor these nodes to find
> out if it is temporary network latency that is causing the disjoin.
A cluster is always very sensitive to network latency/failures. You need to
build on stronger fondations.
Regards,
More information about the Users
mailing list