<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr">Hello :<div><br></div><div>I have a 3 node corosync and pacemaker cluster and the nodes are:</div><div><div>Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]</div><div><br></div><div>Full list of resources:</div><div><br></div><div> Master/Slave Set: ms_mysql [p_mysql]</div><div> Masters: [ SG-azfw2-189 ]</div><div> Slaves: [ SG-azfw2-190 SG-azfw2-191 ]</div></div><div><br></div><div>For my network partition test, I created a firewall rule on Node
SG-azfw2-190 to block all incoming udp traffic from node SG-azfw2-189</div><div>/sbin/iptables -I INPUT -p udp -s 172.19.0.13 -j DROP<br></div><div><br></div><div>I dont think corosync is correctly detecting the partition as I am getting different membership information from different nodes.</div><div>On node SG-azfw2-189, I still see the members as:</div><div><br></div><div><div>Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]</div><div><br></div><div>Full list of resources:</div><div><br></div><div> Master/Slave Set: ms_mysql [p_mysql]</div><div> Masters: [ SG-azfw2-189 ]</div><div> Slaves: [ SG-azfw2-190 SG-azfw2-191 ]</div></div><div><br></div><div>whereas, on the node SG-azfw2-190, I see membership as </div><div><br></div><div><div>Online: [ SG-azfw2-190 SG-azfw2-191 ]</div><div>OFFLINE: [ SG-azfw2-189 ]</div><div><br></div><div>Full list of resources:</div><div><br></div><div> Master/Slave Set: ms_mysql [p_mysql]</div><div> Slaves: [ SG-azfw2-190 SG-azfw2-191 ]</div><div> Stopped: [ SG-azfw2-189 ]</div></div><div><br></div><div>I expected that on node SG-azfw2-189, it should have detected that other 2 nodes have left. In the corosync logs for this node, I continuously see the below messages:</div><div><div>Apr 30 11:00:03 corosync [TOTEM ] entering GATHER state from 4.</div><div>Apr 30 11:00:03 corosync [TOTEM ] Creating commit token because I am the rep.</div><div>Apr 30 11:00:03 corosync [MAIN ] Storing new sequence id for ring 2e64</div><div>Apr 30 11:00:03 corosync [TOTEM ] entering COMMIT state.</div><div>Apr 30 11:00:33 corosync [TOTEM ] The token was lost in the COMMIT state.</div><div>Apr 30 11:00:33 corosync [TOTEM ] entering GATHER state from 4.</div><div>Apr 30 11:00:33 corosync [TOTEM ] Creating commit token because I am the rep.</div><div>Apr 30 11:00:33 corosync [MAIN ] Storing new sequence id for ring 2e68</div><div>Apr 30 11:00:33 corosync [TOTEM ] entering COMMIT state.</div><div>Apr 30 11:01:03 corosync [TOTEM ] The token was lost in the COMMIT state.</div></div><div><br></div><div>On the other nodes - I see messages like</div><div><div> notice: pcmk_peer_update: Transitional membership event on ring 11888: memb=2, new=0, lost=0</div><div>Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: memb: SG-azfw2-190 301994924</div><div>Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: memb: SG-azfw2-191 603984812</div><div>Apr 30 11:06:10 corosync [TOTEM ] waiting_trans_ack changed to 1</div><div>Apr 30 11:06:10 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 11888: memb=2, new=0, lost=0</div><div>Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: MEMB: SG-azfw2-190 301994924</div><div>Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: MEMB: SG-azfw2-191 603984812</div><div>Apr 30 11:06:10 corosync [SYNC ] This node is within the primary component and will provide service.</div><div>Apr 30 11:06:10 corosync [TOTEM ] entering OPERATIONAL state.</div></div><div><br></div><div>Can the corosync experts please guide me on probable root cause for this or ways to debug this further ? Help much appreciated.</div><div><br></div><div>corosync version: 1.4.8.</div><div><div>pacemaker version: 1.1.14-8.el6_8.1</div></div><div><br></div><div>Thanks!</div><div><br></div><div><br></div></div></div></div></div></div></div></div></div></div>