[ClusterLabs] Two node cluster goes into split brain scenario during CPU intensive tasks

Mon Jun 24 10:57:47 EDT 2019

On Mon, 2019-06-24 at 08:52 +0200, Jan Friesse wrote:
> Somanath,
> 
> > Hi All,
> > 
> > I have a two node cluster with multicast (udp) transport . The
> > multicast IP used in 224.1.1.1 .
> 
> Would you mind to give a try to UDPU (unicast)? For two node cluster 
> there is going to be no difference in terms of speed/throughput.
> 
> > 
> > Whenever there is a CPU intensive task the pcs cluster goes into
> > split brain scenario and doesn't recover automatically . We have to

In addition to others' comments: if fencing is enabled, split brain
should not be possible. Automatic recovery should work as long as
fencing succeeds. With fencing disabled, split brain with no automatic
recovery can definitely happen.

> > do a manual restart of services to bring both nodes online again. 
> 
> Before the nodes goes into split brain , the corosync log shows ,
> > 
> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
> > 7c 7e
> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
> > 7c 7e
> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
> > 7c 7e
> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
> > 7c 7e
> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
> > 7c 7e
> 
> This is usually happening when:
> - multicast is somehow rate-limited on switch side
> (configuration/bad 
> switch implementation/...)
> - MTU of network is smaller than 1500 bytes and fragmentation is not 
> allowed -> try reduce totem.netmtu
> 
> Regards,
>    Honza
> 
> 
> > May 24 15:51:42 server1 corosync[4745]:  [TOTEM ] A processor
> > failed, forming new configuration.
> > May 24 16:41:42 server1 corosync[4745]:  [TOTEM ] A new membership
> > (10.241.31.12:29276) was formed. Members left: 1
> > May 24 16:41:42 server1 corosync[4745]:  [TOTEM ] Failed to receive
> > the leave message. failed: 1
> > 
> > Is there any way we can overcome this or this may be due to any
> > multicast issues in the network side.
> > 
> > With Regards
> > Somanath Thilak J
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> > 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>