<div dir="ltr"><div>Hello Jan,</div><div><br></div><div dir="ltr">>Please block both input and output. Corosync isn't able to handle <br>>byzantine faults. </div><div dir="ltr"><br></div><div>Thanks. It results in clean partition if I block both outgoing and incoming udp traffic to and from a given node.</div><div><br></div><div>However, could you suggest me what is the best way to handle any real world production scenarios that may result in just one way traffic loss ?</div><div><br></div><div>Thanks again.</div><div>Prasad   </div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 30, 2019 at 5:26 PM Jan Friesse <<a href="mailto:jfriesse@redhat.com">jfriesse@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Prasad,<br>

<br>

> Hello :<br>

> <br>

> I have a 3 node corosync and pacemaker cluster and the nodes are:<br>

> Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]<br>

> <br>

> Full list of resources:<br>

> <br>

>   Master/Slave Set: ms_mysql [p_mysql]<br>

>       Masters: [ SG-azfw2-189 ]<br>

>       Slaves: [ SG-azfw2-190 SG-azfw2-191 ]<br>

> <br>

> For my network partition test, I created a firewall rule on Node<br>

> SG-azfw2-190   to block all incoming udp traffic from node SG-azfw2-189<br>

> /sbin/iptables -I  INPUT -p udp -s 172.19.0.13 -j DROP<br>

<br>

Please block both input and output. Corosync isn't able to handle <br>

byzantine faults.<br>

<br>

Honza<br>

<br>

> <br>

> I dont think corosync is correctly detecting the partition as I am getting<br>

> different membership information from different nodes.<br>

> On node  SG-azfw2-189, I still see the members as:<br>

> <br>

> Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]<br>

> <br>

> Full list of resources:<br>

> <br>

>   Master/Slave Set: ms_mysql [p_mysql]<br>

>       Masters: [ SG-azfw2-189 ]<br>

>       Slaves: [ SG-azfw2-190 SG-azfw2-191 ]<br>

> <br>

> whereas, on the node SG-azfw2-190, I see membership as<br>

> <br>

> Online: [ SG-azfw2-190 SG-azfw2-191 ]<br>

> OFFLINE: [ SG-azfw2-189 ]<br>

> <br>

> Full list of resources:<br>

> <br>

>   Master/Slave Set: ms_mysql [p_mysql]<br>

>       Slaves: [ SG-azfw2-190 SG-azfw2-191 ]<br>

>       Stopped: [ SG-azfw2-189 ]<br>

> <br>

> I expected that on node SG-azfw2-189, it should have detected that other 2<br>

> nodes have left. In the corosync logs for this node, I continuously see the<br>

> below messages:<br>

> Apr 30 11:00:03 corosync [TOTEM ] entering GATHER state from 4.<br>

> Apr 30 11:00:03 corosync [TOTEM ] Creating commit token because I am the<br>

> rep.<br>

> Apr 30 11:00:03 corosync [MAIN  ] Storing new sequence id for ring 2e64<br>

> Apr 30 11:00:03 corosync [TOTEM ] entering COMMIT state.<br>

> Apr 30 11:00:33 corosync [TOTEM ] The token was lost in the COMMIT state.<br>

> Apr 30 11:00:33 corosync [TOTEM ] entering GATHER state from 4.<br>

> Apr 30 11:00:33 corosync [TOTEM ] Creating commit token because I am the<br>

> rep.<br>

> Apr 30 11:00:33 corosync [MAIN  ] Storing new sequence id for ring 2e68<br>

> Apr 30 11:00:33 corosync [TOTEM ] entering COMMIT state.<br>

> Apr 30 11:01:03 corosync [TOTEM ] The token was lost in the COMMIT state.<br>

> <br>

> On the other nodes - I see messages like<br>

>   notice: pcmk_peer_update: Transitional membership event on ring 11888:<br>

> memb=2, new=0, lost=0<br>

> Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: memb:<br>

> SG-azfw2-190 301994924<br>

> Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: memb:<br>

> SG-azfw2-191 603984812<br>

> Apr 30 11:06:10 corosync [TOTEM ] waiting_trans_ack changed to 1<br>

> Apr 30 11:06:10 corosync [pcmk  ] notice: pcmk_peer_update: Stable<br>

> membership event on ring 11888: memb=2, new=0, lost=0<br>

> Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: MEMB:<br>

> SG-azfw2-190 301994924<br>

> Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: MEMB:<br>

> SG-azfw2-191 603984812<br>

> Apr 30 11:06:10 corosync [SYNC  ] This node is within the primary component<br>

> and will provide service.<br>

> Apr 30 11:06:10 corosync [TOTEM ] entering OPERATIONAL state.<br>

> <br>

> Can the corosync experts please guide me on probable root cause for this or<br>

> ways to debug this further ? Help much appreciated.<br>

> <br>

> corosync version: 1.4.8.<br>

> pacemaker version:  1.1.14-8.el6_8.1<br>

> <br>

> Thanks!<br>

> <br>

> <br>

> <br>

> _______________________________________________<br>

> Manage your subscription:<br>

> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

> <br>

> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

> <br>

<br>

</blockquote></div></div>