[ClusterLabs] Corosync unable to reach consensus for membership

Thu May 2 02:37:23 EDT 2019

Prasad,

> Hello Jan,
> 
>> Please block both input and output. Corosync isn't able to handle
>> byzantine faults.
> 
> Thanks. It results in clean partition if I block both outgoing and incoming
> udp traffic to and from a given node.

Good

> 
> However, could you suggest me what is the best way to handle any real world
> production scenarios that may result in just one way traffic loss ?

There is not too much corosync itself can do. But stonith with power 
fencing will handle this situation quite well because quorate partition 
will kill the node.

Regards,
   Honza

> 
> Thanks again.
> Prasad
> On Tue, Apr 30, 2019 at 5:26 PM Jan Friesse <jfriesse at redhat.com> wrote:
> 
>> Prasad,
>>
>>> Hello :
>>>
>>> I have a 3 node corosync and pacemaker cluster and the nodes are:
>>> Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]
>>>
>>> Full list of resources:
>>>
>>>    Master/Slave Set: ms_mysql [p_mysql]
>>>        Masters: [ SG-azfw2-189 ]
>>>        Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
>>>
>>> For my network partition test, I created a firewall rule on Node
>>> SG-azfw2-190   to block all incoming udp traffic from node SG-azfw2-189
>>> /sbin/iptables -I  INPUT -p udp -s 172.19.0.13 -j DROP
>>
>> Please block both input and output. Corosync isn't able to handle
>> byzantine faults.
>>
>> Honza
>>
>>>
>>> I dont think corosync is correctly detecting the partition as I am
>> getting
>>> different membership information from different nodes.
>>> On node  SG-azfw2-189, I still see the members as:
>>>
>>> Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]
>>>
>>> Full list of resources:
>>>
>>>    Master/Slave Set: ms_mysql [p_mysql]
>>>        Masters: [ SG-azfw2-189 ]
>>>        Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
>>>
>>> whereas, on the node SG-azfw2-190, I see membership as
>>>
>>> Online: [ SG-azfw2-190 SG-azfw2-191 ]
>>> OFFLINE: [ SG-azfw2-189 ]
>>>
>>> Full list of resources:
>>>
>>>    Master/Slave Set: ms_mysql [p_mysql]
>>>        Slaves: [ SG-azfw2-190 SG-azfw2-191 ]
>>>        Stopped: [ SG-azfw2-189 ]
>>>
>>> I expected that on node SG-azfw2-189, it should have detected that other
>> 2
>>> nodes have left. In the corosync logs for this node, I continuously see
>> the
>>> below messages:
>>> Apr 30 11:00:03 corosync [TOTEM ] entering GATHER state from 4.
>>> Apr 30 11:00:03 corosync [TOTEM ] Creating commit token because I am the
>>> rep.
>>> Apr 30 11:00:03 corosync [MAIN  ] Storing new sequence id for ring 2e64
>>> Apr 30 11:00:03 corosync [TOTEM ] entering COMMIT state.
>>> Apr 30 11:00:33 corosync [TOTEM ] The token was lost in the COMMIT state.
>>> Apr 30 11:00:33 corosync [TOTEM ] entering GATHER state from 4.
>>> Apr 30 11:00:33 corosync [TOTEM ] Creating commit token because I am the
>>> rep.
>>> Apr 30 11:00:33 corosync [MAIN  ] Storing new sequence id for ring 2e68
>>> Apr 30 11:00:33 corosync [TOTEM ] entering COMMIT state.
>>> Apr 30 11:01:03 corosync [TOTEM ] The token was lost in the COMMIT state.
>>>
>>> On the other nodes - I see messages like
>>>    notice: pcmk_peer_update: Transitional membership event on ring 11888:
>>> memb=2, new=0, lost=0
>>> Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: memb:
>>> SG-azfw2-190 301994924
>>> Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: memb:
>>> SG-azfw2-191 603984812
>>> Apr 30 11:06:10 corosync [TOTEM ] waiting_trans_ack changed to 1
>>> Apr 30 11:06:10 corosync [pcmk  ] notice: pcmk_peer_update: Stable
>>> membership event on ring 11888: memb=2, new=0, lost=0
>>> Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
>>> SG-azfw2-190 301994924
>>> Apr 30 11:06:10 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
>>> SG-azfw2-191 603984812
>>> Apr 30 11:06:10 corosync [SYNC  ] This node is within the primary
>> component
>>> and will provide service.
>>> Apr 30 11:06:10 corosync [TOTEM ] entering OPERATIONAL state.
>>>
>>> Can the corosync experts please guide me on probable root cause for this
>> or
>>> ways to debug this further ? Help much appreciated.
>>>
>>> corosync version: 1.4.8.
>>> pacemaker version:  1.1.14-8.el6_8.1
>>>
>>> Thanks!
>>>
>>>
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>
>>
>