[ClusterLabs] Corosync unable to reach consensus for membership

Tue Apr 30 07:56:28 EDT 2019

Prasad,

> Hello :
> 
> I have a 3 node corosync and pacemaker cluster and the nodes are:
> Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]
> 
> Full list of resources:
> 
>   Master/Slave Set: ms_mysql [p_mysql]
>       Masters: [ SG-azfw2-189 ]
>       Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
> 
> For my network partition test, I created a firewall rule on Node
> SG-azfw2-190   to block all incoming udp traffic from node SG-azfw2-189
> /sbin/iptables -I  INPUT -p udp -s 172.19.0.13 -j DROP

Please block both input and output. Corosync isn't able to handle 
byzantine faults.

Honza

> 
> I dont think corosync is correctly detecting the partition as I am getting
> different membership information from different nodes.
> On node  SG-azfw2-189, I still see the members as:
> 
> Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]
> 
> Full list of resources:
> 
>   Master/Slave Set: ms_mysql [p_mysql]
>       Masters: [ SG-azfw2-189 ]
>       Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
> 
> whereas, on the node SG-azfw2-190, I see membership as
> 
> Online: [ SG-azfw2-190 SG-azfw2-191 ]
> OFFLINE: [ SG-azfw2-189 ]
> 
> Full list of resources:
> 
>   Master/Slave Set: ms_mysql [p_mysql]
>       Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
>       Stopped: [ SG-azfw2-189 ]
> 
> I expected that on node SG-azfw2-189, it should have detected that other 2
> nodes have left. In the corosync logs for this node, I continuously see the
> below messages:
> Apr 30 11:00:03 corosync [TOTEM ] entering GATHER state from 4.
> Apr 30 11:00:03 corosync [TOTEM ] Creating commit token because I am the
> rep.
> Apr 30 11:00:03 corosync [MAIN  ] Storing new sequence id for ring 2e64
> Apr 30 11:00:03 corosync [TOTEM ] entering COMMIT state.
> Apr 30 11:00:33 corosync [TOTEM ] The token was lost in the COMMIT state.
> Apr 30 11:00:33 corosync [TOTEM ] entering GATHER state from 4.
> Apr 30 11:00:33 corosync [TOTEM ] Creating commit token because I am the
> rep.
> Apr 30 11:00:33 corosync [MAIN  ] Storing new sequence id for ring 2e68
> Apr 30 11:00:33 corosync [TOTEM ] entering COMMIT state.
> Apr 30 11:01:03 corosync [TOTEM ] The token was lost in the COMMIT state.
> 
> On the other nodes - I see messages like
>   notice: pcmk_peer_update: Transitional membership event on ring 11888:
> memb=2, new=0, lost=0
> Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: memb:
> SG-azfw2-190 301994924
> Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: memb:
> SG-azfw2-191 603984812
> Apr 30 11:06:10 corosync [TOTEM ] waiting_trans_ack changed to 1
> Apr 30 11:06:10 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 11888: memb=2, new=0, lost=0
> Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> SG-azfw2-190 301994924
> Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> SG-azfw2-191 603984812
> Apr 30 11:06:10 corosync [SYNC  ] This node is within the primary component
> and will provide service.
> Apr 30 11:06:10 corosync [TOTEM ] entering OPERATIONAL state.
> 
> Can the corosync experts please guide me on probable root cause for this or
> ways to debug this further ? Help much appreciated.
> 
> corosync version: 1.4.8.
> pacemaker version:  1.1.14-8.el6_8.1
> 
> Thanks!
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>