[ClusterLabs] Corosync unable to reach consensus for membership

Wed May 1 02:33:53 EDT 2019

Hello Jan,

>Please block both input and output. Corosync isn't able to handle
>byzantine faults.

Thanks. It results in clean partition if I block both outgoing and incoming
udp traffic to and from a given node.

However, could you suggest me what is the best way to handle any real world
production scenarios that may result in just one way traffic loss ?

Thanks again.
Prasad
On Tue, Apr 30, 2019 at 5:26 PM Jan Friesse <jfriesse at redhat.com> wrote:

> Prasad,
>
> > Hello :
> >
> > I have a 3 node corosync and pacemaker cluster and the nodes are:
> > Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]
> >
> > Full list of resources:
> >
> >   Master/Slave Set: ms_mysql [p_mysql]
> >       Masters: [ SG-azfw2-189 ]
> >       Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
> >
> > For my network partition test, I created a firewall rule on Node
> > SG-azfw2-190   to block all incoming udp traffic from node SG-azfw2-189
> > /sbin/iptables -I  INPUT -p udp -s 172.19.0.13 -j DROP
>
> Please block both input and output. Corosync isn't able to handle
> byzantine faults.
>
> Honza
>
> >
> > I dont think corosync is correctly detecting the partition as I am
> getting
> > different membership information from different nodes.
> > On node  SG-azfw2-189, I still see the members as:
> >
> > Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]
> >
> > Full list of resources:
> >
> >   Master/Slave Set: ms_mysql [p_mysql]
> >       Masters: [ SG-azfw2-189 ]
> >       Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
> >
> > whereas, on the node SG-azfw2-190, I see membership as
> >
> > Online: [ SG-azfw2-190 SG-azfw2-191 ]
> > OFFLINE: [ SG-azfw2-189 ]
> >
> > Full list of resources:
> >
> >   Master/Slave Set: ms_mysql [p_mysql]
> >       Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
> >       Stopped: [ SG-azfw2-189 ]
> >
> > I expected that on node SG-azfw2-189, it should have detected that other
> 2
> > nodes have left. In the corosync logs for this node, I continuously see
> the
> > below messages:
> > Apr 30 11:00:03 corosync [TOTEM ] entering GATHER state from 4.
> > Apr 30 11:00:03 corosync [TOTEM ] Creating commit token because I am the
> > rep.
> > Apr 30 11:00:03 corosync [MAIN  ] Storing new sequence id for ring 2e64
> > Apr 30 11:00:03 corosync [TOTEM ] entering COMMIT state.
> > Apr 30 11:00:33 corosync [TOTEM ] The token was lost in the COMMIT state.
> > Apr 30 11:00:33 corosync [TOTEM ] entering GATHER state from 4.
> > Apr 30 11:00:33 corosync [TOTEM ] Creating commit token because I am the
> > rep.
> > Apr 30 11:00:33 corosync [MAIN  ] Storing new sequence id for ring 2e68
> > Apr 30 11:00:33 corosync [TOTEM ] entering COMMIT state.
> > Apr 30 11:01:03 corosync [TOTEM ] The token was lost in the COMMIT state.
> >
> > On the other nodes - I see messages like
> >   notice: pcmk_peer_update: Transitional membership event on ring 11888:
> > memb=2, new=0, lost=0
> > Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > SG-azfw2-190 301994924
> > Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > SG-azfw2-191 603984812
> > Apr 30 11:06:10 corosync [TOTEM ] waiting_trans_ack changed to 1
> > Apr 30 11:06:10 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> > membership event on ring 11888: memb=2, new=0, lost=0
> > Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > SG-azfw2-190 301994924
> > Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > SG-azfw2-191 603984812
> > Apr 30 11:06:10 corosync [SYNC  ] This node is within the primary
> component
> > and will provide service.
> > Apr 30 11:06:10 corosync [TOTEM ] entering OPERATIONAL state.
> >
> > Can the corosync experts please guide me on probable root cause for this
> or
> > ways to debug this further ? Help much appreciated.
> >
> > corosync version: 1.4.8.
> > pacemaker version:  1.1.14-8.el6_8.1
> >
> > Thanks!
> >
> >
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190501/64885ecb/attachment.html>