[ClusterLabs] [PaceMaker] Help troubleshooting frequent disjoin issue

Wed Mar 22 13:17:41 EDT 2023

Thanks for your advices. As I suspected, the zabbix-proxy service has 
been started on the backup node today at 16h46

Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-attrd     [946] 
(pcmk_cpg_membership)     info: Group attrd event 95: 
zabbix-proxy-dc2-01 (node 2 pid 965) left via cluster exit
Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-attrd     [946] 
(crm_update_peer_proc)    info: pcmk_cpg_membership: Node 
zabbix-proxy-dc2-01[2] - corosync-cpg is now offline
Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(pcmk_cpg_membership)     info: Group cib event 96: zabbix-proxy-dc2-01 
(node 2 pid 962) left via cluster exit
Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-attrd     [946] 
(attrd_remove_voter)      notice: Lost attribute writer zabbix-proxy-dc2-01
Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(crm_update_peer_proc)    info: pcmk_cpg_membership: Node 
zabbix-proxy-dc2-01[2] - corosync-cpg is now offline
Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-attrd     [946] 
(attrd_start_election_if_needed)  info: Starting an election to 
determine the writer
Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-attrd     [946] 
(crm_cs_flush)    info: Sent 0 CPG messages  (1 remaining, last=262): 
Try again (6)
Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(crm_update_peer_state_iter)      notice: Node zabbix-proxy-dc2-01 state 
is now lost | nodeid=2 previous=member source=crm_update_peer_proc
Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(reap_crm_member)         notice: Purged 1 peer with id=2 and/or 
uname=zabbix-proxy-dc2-01 from the membership cache
Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(pcmk_cpg_membership)     info: Group cib event 96: zabbix-proxy-dc1-01 
(node 1 pid 942) is member
Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-attrd     [946] 
(attrd_peer_remove)       notice: Removing all zabbix-proxy-dc2-01 
attributes for peer loss
Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-attrd     [946] 
(crm_reap_dead_member)    info: Removing node with name 
zabbix-proxy-dc2-01 and id 2 from membership cache
Mar 22 16:46:52 zabbix-proxy-dc1-01 pacemaker-attrd     [946] 
(reap_crm_member)         notice: Purged 1 peer with id=2 and/or 
uname=zabbix-proxy-dc2-01 from the membership cache
Mar 22 16:46:53 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(cib_process_readwrite)   info: We are now in R/W mode
Mar 22 16:46:53 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(cib_process_request)     info: Completed cib_master operation for 
section 'all': OK (rc=0, origin=local/crmd/971, version=0.70.0)
Mar 22 16:46:53 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(cib_process_request)     info: Forwarding cib_modify operation for 
section cib to all (origin=local/crmd/972)
Mar 22 16:46:53 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(cib_process_request)     info: Completed cib_modify operation for 
section cib: OK (rc=0, origin=zabbix-proxy-dc1-01/crmd/972, version=0.70.0)
Mar 22 16:46:53 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(cib_process_request)     info: Forwarding cib_modify operation for 
section crm_config to all (origin=local/crmd/974)
Mar 22 16:46:53 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(cib_process_request)     info: Completed cib_modify operation for 
section crm_config: OK (rc=0, origin=zabbix-proxy-dc1-01/crmd/974, 
version=0.70.0)
Mar 22 16:46:53 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(cib_process_request)     info: Forwarding cib_modify operation for 
section crm_config to all (origin=local/crmd/976)
Mar 22 16:46:53 zabbix-proxy-dc1-01 pacemaker-based     [942] 
(cib_process_request)     info: Completed cib_modify operation for 
section crm_config: OK (rc=0, origin=zabbix-proxy-dc1-01/crmd/976, 
version=0.70.0)
Mar 22 16:46:53 zabbix-proxy-dc1-01 pacemaker-controld  [948] 
(crm_update_peer_join)    info: initialize_join: Node 
zabbix-proxy-dc1-01[1] - join-31 phase confirmed -> none
Mar 22 16:46:53 zabbix-proxy-dc1-01 pacemaker-controld  [948] 
(join_make_offer)         info: Not making an offer to 
zabbix-proxy-dc2-01: not active (lost)

I think that quorum setup has something wrong, I need some help to 
figure out:

pcs quorum status
Quorum information
------------------
Date:             Wed Mar 22 18:10:57 2023
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/22376
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
     Nodeid      Votes    Qdevice Name
          1          1    A,V,NMW zabbix-proxy-dc1-01 (local)
          2          1    A,V,NMW zabbix-proxy-dc2-01
          0          1            Qdevice

It looks okay to me, but you said if the node is isolated it should stop 
everything, and this is not happening.

Thank you

Le 21/03/2023 à 12:01, Jehan-Guillaume de Rorthais a écrit :
> On Tue, 21 Mar 2023 11:47:23 +0100
> Jérôme BECOT<jerome.becot at deveryware.com>  wrote:
>
>> Le 21/03/2023 à 11:00, Jehan-Guillaume de Rorthais a écrit :
>>> Hi,
>>>
>>> On Tue, 21 Mar 2023 09:33:04 +0100
>>> Jérôme BECOT<jerome.becot at deveryware.com>   wrote:
>>>   
>>>> We have several clusters running for different zabbix components. Some
>>>> of these clusters consist of 2 zabbix proxies,where nodes run Mysql,
>>>> Zabbix-proxy server and a VIP, and a corosync-qdevice.
>>> I'm not sure to understand your topology. The corosync-device is not
>>> supposed to be on a cluster node. It is supposed to be on a remote node and
>>> provide some quorum features to one or more cluster without setting up the
>>> whole pacemaker/corosync stack.
>> I was not clear, the qdevice is deployed on a remote node, as intended.
> ok
>
>>>> The MySQL servers are always up to replicate, and are configured in
>>>> Master/Master (they both replicate from the other but only one is supposed
>>>> to be updated by the proxy running on the master node).
>>> Why do you bother with Master/Master when a simple (I suppose, I'm not a
>>> MySQL cluster guy) Primary-Secondary topology or even a shared storage
>>> would be enough and would keep your logic (writes on one node only) safe
>>> from incidents, failures, errors, etc?
>>>
>>> HA must be a simple as possible. Remove useless parts when you can.
>> A shared storage moves the complexity somewhere else.
> Yes, on storage/SAN side.
>
>> A classic Primary / secondary can be an option if PaceMaker manages to start
>> the client on the slave node,
> I suppose this can be done using a location constraint.
>
>> but it would become Master/Master during the split brain.
> No, and if you do have real split brain, then you might have something wrong in
> your setup. See below.
>
>
>>>> One cluster is prompt to frequent sync errors, with duplicate entries
>>>> errors in SQL. When I look at the logs, I can see "Mar 21 09:11:41
>>>> zabbix-proxy-01 pacemaker-controld  [948] (pcmk_cpg_membership)
>>>> info: Group crmd event 89: zabbix-proxy-02 (node 2 pid 967) left via
>>>> cluster exit", and within the next second, a rejoin. The same messages
>>>> are in the other node logs, suggesting a split brain, which should not
>>>> happen, because there is a quorum device.
>>> Would it be possible your SQL sync errors and the left/join issues are
>>> correlated and are both symptoms of another failure? Look at your log for
>>> some explanation about why the node decided to leave the cluster.
>> My guess is that maybe a high latency in network cause the disjoin,
>> hence starting Zabbix-proxy on both nodes causes the replication error.
>> It is configured to use the vip which is up locally because there is a
>> split brain.
> If you have a split brain, that means your quorum setup is failing.
>
> No node could start/promote a resource without having the quorum. If a node is
> isolated from the cluster and quorum-device, it should stop its resources, not
> recover/promote them.
>
> If both nodes lost connection with each others, but are still connected to the
> quorum-device, the later should be able to grant the quorum on one side only.
>
> Lastly, quorum is a split brain protection when "things are going fine".
> Fencing is a split brain protection for all other situations. Fencing is hard
> and painful, but it saves from many split brain situation.
>
>> This is why I'm requesting guidance to check/monitor these nodes to find
>> out if it is temporary network latency that is causing the disjoin.
> A cluster is always very sensitive to network latency/failures. You need to
> build on stronger fondations.
>
> Regards,
-- 
*Jérôme BECOT*
Ingénieur DevOps Infrastructure

Téléphone fixe: 01 82 28 37 06
Mobile : +33 757 173 193
Deveryware - 43 rue Taitbout - 75009 PARIS
https://www.deveryware.com <https://www.deveryware.com>

Deveryware_Logo
<https://www.deveryware.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230322/7b2e3635/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: baniere_signature_dw_2022.png
Type: image/png
Size: 471995 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230322/7b2e3635/attachment-0001.png>