[ClusterLabs] Antw: Re: A processor failed, forming new configuration very often and without reason

Thu Apr 30 06:30:23 UTC 2015

Philippe,

Philippe Carbonnier napsal(a):
> Thanks for your answers.
> Token value was previoulsly 5000, but I already increased it to 10000,
> without any change. So 10 secondes before TOTEM fire the "A processor
> failed, forming new configuration" message, but in the log we see that in
> the same second the other node reappeared !

That's weird

> Should I use an higher token value ?

I don't think so. I mean, if both nodes are running on same ESX, it 
shouldn't be needed.

- Is corosync scheduled regularly (you would see message "Corosync main 
process was not scheduled for ... sec" in logs if not)?
- Is firewall correctly configured?
- Isn't there some kind of rate limiting for packets?

Regards,
   Honza

>
> Best regards,
>
> 2015-04-29 14:17 GMT+02:00 Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>:
>
>>>>> Jan Friesse <jfriesse at redhat.com> schrieb am 29.04.2015 um 13:10 in
>> Nachricht
>> <5540BC0B.50409 at redhat.com>:
>>> Philippe,
>>>
>>> Philippe Carbonnier napsal(a):
>>>> Hello,
>>>> just for the guys who doesn't want to read all the logs, I put my
>> question
>>>> on top (and at the end) of the post :
>>>> Is there a timer that I can raise to try to give more time to each
>> nodes to
>>>> see each other BEFORE TOTEM fire the "A processor failed, forming new
>>>> configuration", because the 2 nodes are really up and running.
>>>
>>> There are many timers, but basically almost everything depends on token
>>> timeout, so just set "token" to higher value.
>>
>> Please correct me if I'm wrong: A token timeout is oly triggered when
>> 1) The token is lost in the network (i.e. a packet is lost and not
>> retransmitted in time)
>> 2) The token is lost on a node (e.g. it crashes while it has the token)
>> 3) The host or the network don't respond in time (the token is not lost,
>> but late)
>> 4) There's a major bug in the TOTEM protocol (its implementation)
>>
>> I really wonder whether the resaon for frequent token timeouts is 1);
>> usually it's not 2) either. For me 3) is hard to believe also. And nobody
>> admits it's 4).
>>
>> So everybody says it's 3) and suggests to increase the timeout.
>>
>>>>
>>>> The 2 linux servers (vif5_7 and host2.example.com) are 2 VM on the same
>>>> VMWare ESX server. May be the network is 'not working' the way corosync
>>>> wants ?
>>
>> OK, for virtual hosts I might add:
>> 5) The virtual time is not flowing steadily, i.e. the number of usable CPU
>> cycles per walltime unit is highly variable.
>>
>>>
>>> Yep. But first give a chance to token timeout increase.
>>
>> I agree that for 5) a longer token timeout might be a workaround, but
>> finding the root cause may be worth the time being spent doing so.
>>
>>
>> Regards,
>> Ulrich
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>