[ClusterLabs] weird corosync - [TOTEM ] FAILED TO RECEIVE

Fri Nov 23 11:36:49 EST 2018

lejeczek,

> On 15/10/2018 07:24, Jan Friesse wrote:
>> lejeczek,
>>
>>> hi guys,
>>> I have a 3-node cluser(centos 7.5), 2 nodes seems fine but third(or 
>>> probably something else in between) is not right.
>>> I see this:
>>>
>>>   $ pcs status --all
>>> Cluster name: CC
>>> Stack: corosync
>>> Current DC: whale.private (version 1.1.18-11.el7_5.3-2b07d5c5a9) - 
>>> partition with quorum
>>> Last updated: Fri Oct 12 15:40:39 2018
>>> Last change: Fri Oct 12 15:14:57 2018 by root via crm_resource on 
>>> whale.private
>>>
>>> 3 nodes configured
>>> 8 resources configured (1 DISABLED)
>>>
>>> Online: [ rental.private whale.private ]
>>> OFFLINE: [ rider.private ]
>>>
>>> and that third node logs:
>>>
>>> [TOTEM ] FAILED TO RECEIVE
>>>   [TOTEM ] A new membership (10.5.6.100:2504344) was formed. Members 
>>> left: 2 4
>>>   [TOTEM ] Failed to receive the leave message. failed: 2 4
>>>   [QUORUM] Members[1]: 1
>>>   [MAIN  ] Completed service synchronization, ready to provide service.
>>>   [TOTEM ] A new membership (10.5.6.49:2504348) was formed. Members 
>>> joined: 2 4
>>>   [TOTEM ] FAILED TO RECEIVE
>>>
>>> and it just keeps going like that.
>>> Sometimes reboot(or stop of services + wait + start) of that third 
>>> node would help.
>>> But, I get this situation almost every time a node gets (orderly) 
>>> shut down or reboot.
>>> Network-wise, connectivity, seem okey. Where to start?
>>>
>>
>> A little more information would be helpful (corosync version, used 
>> protocol - udpu/udp, corosync.conf, ...), but few possible problems:
>> - If UDP (multicast) is used, try UDPU
>> - Check firewall
>> - Try reduce MTU used by corosync (option netmtu in corosync.conf)
>>
>> Regards,
>>   Honza
>>
> One thing I remember - could it be that because at the time of cluster 
> formation(and for some time after) one of the nodes had a different ruby 
> version from what other nodes had?

Probably not, because corosync itself does not have any dependency on ruby.

> 
> I cannot remember when this problem started to appear, was if from the 
> beginning or later, cannot say.
> 
> I'm on Centos 7.6. I do not think I use UDP (other then creation of some 
> resources and constrains it's a "vanilla" cluster). I use a 

That's why I've asked for config files ;)

> "non-default" MTU on the ifaces cluster uses, and also, those interfaces 
> are net-team devices. But still.. why it always be that one node (all 

So it's probably really MTU, please try change option netmtu in 
corosync.conf.

> are virtually identical)

Evil is usually hidden in detail so virtually identical may mean it's 
not identical enough.

> 
> many thanks, L.

Np, but I'm not sure if hints were useful for you or not.

Regards,
   Honza

> 
> 
>>
>>> many thanks, L
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>