[ClusterLabs] Antw: Re: A processor failed, forming new configuration very often and without reason

Wed Apr 29 08:17:49 EDT 2015

>>> Jan Friesse <jfriesse at redhat.com> schrieb am 29.04.2015 um 13:10 in Nachricht
<5540BC0B.50409 at redhat.com>:
> Philippe,
> 
> Philippe Carbonnier napsal(a):
>> Hello,
>> just for the guys who doesn't want to read all the logs, I put my question
>> on top (and at the end) of the post :
>> Is there a timer that I can raise to try to give more time to each nodes to
>> see each other BEFORE TOTEM fire the "A processor failed, forming new
>> configuration", because the 2 nodes are really up and running.
> 
> There are many timers, but basically almost everything depends on token
> timeout, so just set "token" to higher value.

Please correct me if I'm wrong: A token timeout is oly triggered when
1) The token is lost in the network (i.e. a packet is lost and not retransmitted in time)
2) The token is lost on a node (e.g. it crashes while it has the token)
3) The host or the network don't respond in time (the token is not lost, but late)
4) There's a major bug in the TOTEM protocol (its implementation)

I really wonder whether the resaon for frequent token timeouts is 1); usually it's not 2) either. For me 3) is hard to believe also. And nobody admits it's 4).

So everybody says it's 3) and suggests to increase the timeout.

>> 
>> The 2 linux servers (vif5_7 and host2.example.com) are 2 VM on the same
>> VMWare ESX server. May be the network is 'not working' the way corosync
>> wants ?

OK, for virtual hosts I might add:
5) The virtual time is not flowing steadily, i.e. the number of usable CPU cycles per walltime unit is highly variable.

> 
> Yep. But first give a chance to token timeout increase.

I agree that for 5) a longer token timeout might be a workaround, but finding the root cause may be worth the time being spent doing so.

Regards,
Ulrich