[ClusterLabs] Antw: Re: A processor failed, forming new configuration very often and without reason
jfriesse at redhat.com
Thu Apr 30 02:30:23 EDT 2015
Philippe Carbonnier napsal(a):
> Thanks for your answers.
> Token value was previoulsly 5000, but I already increased it to 10000,
> without any change. So 10 secondes before TOTEM fire the "A processor
> failed, forming new configuration" message, but in the log we see that in
> the same second the other node reappeared !
> Should I use an higher token value ?
I don't think so. I mean, if both nodes are running on same ESX, it
shouldn't be needed.
- Is corosync scheduled regularly (you would see message "Corosync main
process was not scheduled for ... sec" in logs if not)?
- Is firewall correctly configured?
- Isn't there some kind of rate limiting for packets?
> Best regards,
> 2015-04-29 14:17 GMT+02:00 Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>:
>>>>> Jan Friesse <jfriesse at redhat.com> schrieb am 29.04.2015 um 13:10 in
>> <5540BC0B.50409 at redhat.com>:
>>> Philippe Carbonnier napsal(a):
>>>> just for the guys who doesn't want to read all the logs, I put my
>>>> on top (and at the end) of the post :
>>>> Is there a timer that I can raise to try to give more time to each
>> nodes to
>>>> see each other BEFORE TOTEM fire the "A processor failed, forming new
>>>> configuration", because the 2 nodes are really up and running.
>>> There are many timers, but basically almost everything depends on token
>>> timeout, so just set "token" to higher value.
>> Please correct me if I'm wrong: A token timeout is oly triggered when
>> 1) The token is lost in the network (i.e. a packet is lost and not
>> retransmitted in time)
>> 2) The token is lost on a node (e.g. it crashes while it has the token)
>> 3) The host or the network don't respond in time (the token is not lost,
>> but late)
>> 4) There's a major bug in the TOTEM protocol (its implementation)
>> I really wonder whether the resaon for frequent token timeouts is 1);
>> usually it's not 2) either. For me 3) is hard to believe also. And nobody
>> admits it's 4).
>> So everybody says it's 3) and suggests to increase the timeout.
>>>> The 2 linux servers (vif5_7 and host2.example.com) are 2 VM on the same
>>>> VMWare ESX server. May be the network is 'not working' the way corosync
>>>> wants ?
>> OK, for virtual hosts I might add:
>> 5) The virtual time is not flowing steadily, i.e. the number of usable CPU
>> cycles per walltime unit is highly variable.
>>> Yep. But first give a chance to token timeout increase.
>> I agree that for 5) a longer token timeout might be a workaround, but
>> finding the root cause may be worth the time being spent doing so.
>> Users mailing list: Users at clusterlabs.org
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> Users mailing list: Users at clusterlabs.org
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users