<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">Le 21/03/2023 à 11:00, Jehan-Guillaume
de Rorthais a écrit :<br>
</div>
<blockquote type="cite" cite="mid:20230321110033.5f0df130@karst">
<pre class="moz-quote-pre" wrap="">Hi,
On Tue, 21 Mar 2023 09:33:04 +0100
Jérôme BECOT <a class="moz-txt-link-rfc2396E" href="mailto:jerome.becot@deveryware.com"><jerome.becot@deveryware.com></a> wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">We have several clusters running for different zabbix components. Some
of these clusters consist of 2 zabbix proxies,where nodes run Mysql,
Zabbix-proxy server and a VIP, and a corosync-qdevice.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I'm not sure to understand your topology. The corosync-device is not supposed
to be on a cluster node. It is supposed to be on a remote node and provide some
quorum features to one or more cluster without setting up the whole
pacemaker/corosync stack.</pre>
</blockquote>
I was not clear, the qdevice is deployed on a remote node, as
intended.<br>
<blockquote type="cite" cite="mid:20230321110033.5f0df130@karst">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">The MySQL servers are always up to replicate, and are configured in
Master/Master (they both replicate from the other but only one is supposed to
be updated by the proxy running on the master node).
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Why do you bother with Master/Master when a simple (I suppose, I'm not a MySQL
cluster guy) Primary-Secondary topology or even a shared storage would be
enough and would keep your logic (writes on one node only) safe from incidents,
failures, errors, etc?
HA must be a simple as possible. Remove useless parts when you can.</pre>
</blockquote>
A shared storage moves the complexity somewhere else. A classic
Primary / secondary can be an option if PaceMaker manages to start
the client on the slave node, but it would become Master/Master
during the split brain.<br>
<blockquote type="cite" cite="mid:20230321110033.5f0df130@karst">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">One cluster is prompt to frequent sync errors, with duplicate entries
errors in SQL. When I look at the logs, I can see "Mar 21 09:11:41
zabbix-proxy-01 pacemaker-controld [948] (pcmk_cpg_membership)
info: Group crmd event 89: zabbix-proxy-02 (node 2 pid 967) left via
cluster exit", and within the next second, a rejoin. The same messages
are in the other node logs, suggesting a split brain, which should not
happen, because there is a quorum device.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Would it be possible your SQL sync errors and the left/join issues are
correlated and are both symptoms of another failure? Look at your log for some
explanation about why the node decided to leave the cluster.</pre>
</blockquote>
<p>My guess is that maybe a high latency in network cause the
disjoin, hence starting Zabbix-proxy on both nodes causes the
replication error. It is configured to use the vip which is up
locally because there is a split brain.</p>
<p>This is why I'm requesting guidance to check/monitor these nodes
to find out if it is temporary network latency that is causing the
disjoin.<br>
</p>
<blockquote type="cite" cite="mid:20230321110033.5f0df130@karst">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Can you help me to troubleshoot this ? I can provide any
log/configuration required in the process, so let me know.
I'd also like to ask if there is a bit of configuration that can be done
to postpone service start on the other node for two or three seconds as
a quick workaround ?
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
How would it be a workaround?</pre>
</blockquote>
Because if network issues persist, the proxy would not be started on
the slave node, as the disjoin just last for less than two seconds.
Fixing the network is the solution (but not in my power), delaying
the service start in this case looks like a decent workaround for
me.<br>
<blockquote type="cite" cite="mid:20230321110033.5f0df130@karst">
<pre class="moz-quote-pre" wrap="">
Regards,
</pre>
</blockquote>
<div class="moz-signature">-- <br>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<title></title>
<div class="moz-signature"><b><span style="color:#002060">Jérôme
BECOT</span></b> <span style="color:#002060"></span><br>
<span style="color:#002060">Ingénieur DevOps Infrastructure </span><br>
<br>
<span style="color:#002060">Téléphone fixe: </span> <span
style="color:#002060;mso-fareast-language:FR">01 82 28 37 06</span><br>
<span style="color:#002060">Mobile : +33 757 173 193</span><br>
<span style="color:#002060">Deveryware - 43 rue Taitbout - 75009
PARIS</span><br>
<a moz-do-not-send="true" href="https://www.deveryware.com"> <span
style="color:#002060"><span tyle="color:#002060">
https://www.deveryware.com</span></span></a></div>
<div class="moz-signature"> <span
style="color:#002060;mso-fareast-language:FR"></span><br>
<img moz-do-not-send="false"
src="cid:part1.OjO7RCO0.zpV0gi8l@deveryware.com"
alt="Deveryware_Logo" width="402" height="107"><br>
<a href="https://www.deveryware.com"> <span
style="font-size:10.0pt;color:#08638F;mso-fareast-language:FR;text-decoration:none"></span></a></div>
</div>
</body>
</html>