[ClusterLabs] Antw: Re: Two node cluster goes into split brain scenario during CPU intensive tasks

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Mon Jul 1 06:55:02 EDT 2019


>>> Jan Friesse <jfriesse at redhat.com> schrieb am 24.06.2019 um 08:52 in
Nachricht
<bb3cf1ca-1232-82f8-95f0-de57eadbe647 at redhat.com>:
> Somanath,
> 
>> Hi All,
>> 
>> I have a two node cluster with multicast (udp) transport . The multicast IP

> used in 224.1.1.1 .
> 
> Would you mind to give a try to UDPU (unicast)? For two node cluster 
> there is going to be no difference in terms of speed/throughput.

I think a better recommendation would be raising the timeouts of the corosync
protocol. I agree that the syslog message provides very little usefulk
information to the user: WHY does the retransmit list grow?

> 
>> 
>> Whenever there is a CPU intensive task the pcs cluster goes into split
brain 
> scenario and doesn't recover automatically . We have to do a manual restart

> of services to bring both nodes online again. 
> Before the nodes goes into split brain , the corosync log shows ,
>> 
>> May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List: 7c 7e
>> May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List: 7c 7e
>> May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List: 7c 7e
>> May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List: 7c 7e
>> May 24 15:10:02 server1 corosync[4745]:  [TOTEM ] Retransmit List: 7c 7e
> 
> This is usually happening when:
> ‑ multicast is somehow rate‑limited on switch side (configuration/bad 
> switch implementation/...)
> ‑ MTU of network is smaller than 1500 bytes and fragmentation is not 
> allowed ‑> try reduce totem.netmtu
> 
> Regards,
>    Honza
> 
> 
>> May 24 15:51:42 server1 corosync[4745]:  [TOTEM ] A processor failed, 
> forming new configuration.
>> May 24 16:41:42 server1 corosync[4745]:  [TOTEM ] A new membership 
> (10.241.31.12:29276) was formed. Members left: 1
>> May 24 16:41:42 server1 corosync[4745]:  [TOTEM ] Failed to receive the 
> leave message. failed: 1
>> 
>> Is there any way we can overcome this or this may be due to any multicast 
> issues in the network side.
>> 
>> With Regards
>> Somanath Thilak J
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 





More information about the Users mailing list