[ClusterLabs] Corosync unable to reach consensus for membership
Prasad Nagaraj
prasad.nagaraj76 at gmail.com
Tue Apr 30 07:09:25 EDT 2019
Hello :
I have a 3 node corosync and pacemaker cluster and the nodes are:
Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]
Full list of resources:
Master/Slave Set: ms_mysql [p_mysql]
Masters: [ SG-azfw2-189 ]
Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
For my network partition test, I created a firewall rule on Node
SG-azfw2-190 to block all incoming udp traffic from node SG-azfw2-189
/sbin/iptables -I INPUT -p udp -s 172.19.0.13 -j DROP
I dont think corosync is correctly detecting the partition as I am getting
different membership information from different nodes.
On node SG-azfw2-189, I still see the members as:
Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]
Full list of resources:
Master/Slave Set: ms_mysql [p_mysql]
Masters: [ SG-azfw2-189 ]
Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
whereas, on the node SG-azfw2-190, I see membership as
Online: [ SG-azfw2-190 SG-azfw2-191 ]
OFFLINE: [ SG-azfw2-189 ]
Full list of resources:
Master/Slave Set: ms_mysql [p_mysql]
Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
Stopped: [ SG-azfw2-189 ]
I expected that on node SG-azfw2-189, it should have detected that other 2
nodes have left. In the corosync logs for this node, I continuously see the
below messages:
Apr 30 11:00:03 corosync [TOTEM ] entering GATHER state from 4.
Apr 30 11:00:03 corosync [TOTEM ] Creating commit token because I am the
rep.
Apr 30 11:00:03 corosync [MAIN ] Storing new sequence id for ring 2e64
Apr 30 11:00:03 corosync [TOTEM ] entering COMMIT state.
Apr 30 11:00:33 corosync [TOTEM ] The token was lost in the COMMIT state.
Apr 30 11:00:33 corosync [TOTEM ] entering GATHER state from 4.
Apr 30 11:00:33 corosync [TOTEM ] Creating commit token because I am the
rep.
Apr 30 11:00:33 corosync [MAIN ] Storing new sequence id for ring 2e68
Apr 30 11:00:33 corosync [TOTEM ] entering COMMIT state.
Apr 30 11:01:03 corosync [TOTEM ] The token was lost in the COMMIT state.
On the other nodes - I see messages like
notice: pcmk_peer_update: Transitional membership event on ring 11888:
memb=2, new=0, lost=0
Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: memb:
SG-azfw2-190 301994924
Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: memb:
SG-azfw2-191 603984812
Apr 30 11:06:10 corosync [TOTEM ] waiting_trans_ack changed to 1
Apr 30 11:06:10 corosync [pcmk ] notice: pcmk_peer_update: Stable
membership event on ring 11888: memb=2, new=0, lost=0
Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: MEMB:
SG-azfw2-190 301994924
Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: MEMB:
SG-azfw2-191 603984812
Apr 30 11:06:10 corosync [SYNC ] This node is within the primary component
and will provide service.
Apr 30 11:06:10 corosync [TOTEM ] entering OPERATIONAL state.
Can the corosync experts please guide me on probable root cause for this or
ways to debug this further ? Help much appreciated.
corosync version: 1.4.8.
pacemaker version: 1.1.14-8.el6_8.1
Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190430/59d9d500/attachment.html>
More information about the Users
mailing list