[ClusterLabs] Corosync 3.1.0 token timeout

Jan Friesse jfriesse at redhat.com
Thu Oct 22 03:36:23 EDT 2020


Ulrich,

>>>> Jan Friesse <jfriesse at redhat.com> schrieb am 20.10.2020 um 18:05 in Nachricht
> <9e9edd13-847c-a81f-9b28-0ecf8f17fd48 at redhat.com>:
>> I've forgot to mention one very important change (in text, release notes
>> at github release is already fixed):
>>
> ...
>>
>> - Default token timeout was changed from 1 seconds to 3 seconds. Default
> 
> Hi!
> 
> The same stupid question as always: How is that value determined, assuming that in a LAN the per-hop delay should be less than 1ms these days and the numbe rof nodes typically is much less than 10. Ist there a safety-factor of 1000%, or what?
> Or is this just black magic, and the value was determined in a sleepless fulll-mood night by throwing dice?

It's somewhere in the middle actually.

Reason for increasing the value is number of GSS cases where increase of 
token timeout helped reduce number of "unexpected" fencing events.

The proposal was to increase the value to 5 secs, but that would make 
upgrading hard, because nodes with old version would detect token loss 
(default config is resend token 4 times so 5s/4 = 1.25 secs).

There is no such problem with 3 secs.

The main problem is that choosing timeouts is not exact science. We have 
to choose timeout which is high enough to give nodes enough time in case 
of spikes (various ones - cpu/blocked IO/network/...) but also low 
enough to react as quickly as possible. 1 secs was working well most of 
the time, but then something bad happened and node was fenced "without 
the reason". So to conclude, yes, it is kind of black magic.

Regards,
   Honza

> 
> Regards,
> Ulrich
> 
>> token timeout of 1000 ms was often changed by users because of other
>> workloads on machine which may make corosync responding a bit later than
>> needed and resulting in token loss. 3000 ms was chosen as a compromise
>> between token timeout increase and allow live cluster upgrade (other
>> nodes should receive token by node with new default on time). It doesn't
>> affect token token_coefficient so final token timeout still depends on
>> number of configured nodes (just base is higher).  This change slows
>> down failover a bit so for clusters where failover times are important,
>> please change the token timeout in configuration file corosync.conf as a:
>>
>> totem {
>>     version: 2
>>     token: 1000
>>     ...
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 



More information about the Users mailing list