[ClusterLabs] Two node cluster goes into split brain scenario during CPU intensive tasks

Tue Jun 25 13:20:07 EDT 2019

On Tue, 2019-06-25 at 11:06 +0000, Somanath Jeeva wrote:
> I have not configured fencing in our setup . However I would like to
> know if the split brain can be avoided when high CPU occurs. 

Fencing *is* the way to prevent split brain. If the nodes can't see
each other, one will power down the other, and be able to continue on.

Of course that doesn't address the root cause of the split, but it's
the only way the cluster can recover from a split.

Addressing the root cause, I'd first make sure corosync is running at
real-time priority (I forget the ps option, hopefully someone else can
chime in). Another possibility would be to raise the corosync token
timeout to allow for a greater time before a split is declared.
Finally, if the work causing the load is scheduled, you can schedule
the cluster for maintenance mode during the same time frame, so the
cluster will refrain from reacting to events until the end of the time.

> 
> With Regards
> Somanath Thilak J
> 
> -----Original Message-----
> From: Ken Gaillot <kgaillot at redhat.com> 
> Sent: Monday, June 24, 2019 20:28
> To: Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>; Somanath Jeeva <
> somanath.jeeva at ericsson.com>
> Subject: Re: [ClusterLabs] Two node cluster goes into split brain
> scenario during CPU intensive tasks
> 
> On Mon, 2019-06-24 at 08:52 +0200, Jan Friesse wrote:
> > Somanath,
> > 
> > > Hi All,
> > > 
> > > I have a two node cluster with multicast (udp) transport . The 
> > > multicast IP used in 224.1.1.1 .
> > 
> > Would you mind to give a try to UDPU (unicast)? For two node
> > cluster 
> > there is going to be no difference in terms of speed/throughput.
> > 
> > > 
> > > Whenever there is a CPU intensive task the pcs cluster goes into 
> > > split brain scenario and doesn't recover automatically . We have
> > > to
> 
> In addition to others' comments: if fencing is enabled, split brain
> should not be possible. Automatic recovery should work as long as
> fencing succeeds. With fencing disabled, split brain with no
> automatic recovery can definitely happen.
> 
> > > do a manual restart of services to bring both nodes online
> > > again. 
> > 
> > Before the nodes goes into split brain , the corosync log shows ,
> > > 
> > > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit
> > > List:
> > > 7c 7e
> > > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit
> > > List:
> > > 7c 7e
> > > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit
> > > List:
> > > 7c 7e
> > > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit
> > > List:
> > > 7c 7e
> > > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit
> > > List:
> > > 7c 7e
> > 
> > This is usually happening when:
> > - multicast is somehow rate-limited on switch side
> > (configuration/bad 
> > switch implementation/...)
> > - MTU of network is smaller than 1500 bytes and fragmentation is
> > not 
> > allowed -> try reduce totem.netmtu
> > 
> > Regards,
> >    Honza
> > 
> > 
> > > May 24 15:51:42 server1 corosync[4745]:  [TOTEM ] A processor 
> > > failed, forming new configuration.
> > > May 24 16:41:42 server1 corosync[4745]:  [TOTEM ] A new
> > > membership
> > > (10.241.31.12:29276) was formed. Members left: 1 May 24 16:41:42 
> > > server1 corosync[4745]:  [TOTEM ] Failed to receive the leave 
> > > message. failed: 1
> > > 
> > > Is there any way we can overcome this or this may be due to any 
> > > multicast issues in the network side.
> > > 
> > > With Regards
> > > Somanath Thilak J
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Manage your subscription:
> > > 
https://protect2.fireeye.com/url?k=cf120bda-9398df1b-cf124b41-863d9b
> > > cb726f-
> > > 716d821bbcb5bd46&q=1&u=https%3A%2F%2Flists.clusterlabs.org%2F
> > > mailman%2Flistinfo%2Fusers
> > > 
> > > ClusterLabs home: 
> > > 
https://protect2.fireeye.com/url?k=eb2ec5bb-b7a4117a-eb2e8520-863d9b
> > > cb726f-b47e1043056350cb&q=1&u=https%3A%2F%2F
> > > www.clusterlabs.org%2F
> > > 
> > 
> > _______________________________________________
> > Manage your subscription:
> > 
https://protect2.fireeye.com/url?k=99a652fd-c52c863c-99a61266-863d9bcb
> > 726f-
> > 72abff69ac96d9a3&q=1&u=https%3A%2F%2Flists.clusterlabs.org%2Fmail
> > man%2Flistinfo%2Fusers
> > 
> > ClusterLabs home: 
> > 
https://protect2.fireeye.com/url?k=d77f0141-8bf5d580-d77f41da-863d9bcb
> > 726f-0762985c29a467ea&q=1&u=https%3A%2F%2Fwww.clusterlabs.org%2F
> 
> --
> Ken Gaillot <kgaillot at redhat.com>
> 
-- 
Ken Gaillot <kgaillot at redhat.com>