[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

Mon Jul 1 06:59:35 EDT 2019

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 24.06.2019 um 16:57 in
Nachricht
<95f51b52283d05bbdcccc948e4508c406d7ccb64.camel at redhat.com>:
> On Mon, 2019‑06‑24 at 08:52 +0200, Jan Friesse wrote:
>> Somanath,
>> 
>> > Hi All,
>> > 
>> > I have a two node cluster with multicast (udp) transport . The
>> > multicast IP used in 224.1.1.1 .
>> 
>> Would you mind to give a try to UDPU (unicast)? For two node cluster 
>> there is going to be no difference in terms of speed/throughput.
>> 
>> > 
>> > Whenever there is a CPU intensive task the pcs cluster goes into
>> > split brain scenario and doesn't recover automatically . We have to
> 
> In addition to others' comments: if fencing is enabled, split brain
> should not be possible. Automatic recovery should work as long as

---unless the fencing was caused by a persistent communication problem...

> fencing succeeds. With fencing disabled, split brain with no automatic
> recovery can definitely happen.
> 
>> > do a manual restart of services to bring both nodes online again. 
>> 
>> Before the nodes goes into split brain , the corosync log shows ,
>> > 
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> > May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List:
>> > 7c 7e
>> 
>> This is usually happening when:
>> ‑ multicast is somehow rate‑limited on switch side
>> (configuration/bad 
>> switch implementation/...)
>> ‑ MTU of network is smaller than 1500 bytes and fragmentation is not 
>> allowed ‑> try reduce totem.netmtu
>> 
>> Regards,
>>    Honza
>> 
>> 
>> > May 24 15:51:42 server1 corosync[4745]:  [TOTEM ] A processor
>> > failed, forming new configuration.
>> > May 24 16:41:42 server1 corosync[4745]:  [TOTEM ] A new membership
>> > (10.241.31.12:29276) was formed. Members left: 1
>> > May 24 16:41:42 server1 corosync[4745]:  [TOTEM ] Failed to receive
>> > the leave message. failed: 1
>> > 
>> > Is there any way we can overcome this or this may be due to any
>> > multicast issues in the network side.
>> > 
>> > With Regards
>> > Somanath Thilak J
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > _______________________________________________
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users 
>> > 
>> > ClusterLabs home: https://www.clusterlabs.org/ 
>> > 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
> ‑‑ 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/