[ClusterLabs] Antw: Re: Establishing Timeouts

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Tue Oct 11 06:31:00 UTC 2016


>>> Klaus Wenninger <kwenning at redhat.com> schrieb am 10.10.2016 um 20:04 in
Nachricht <936e4d4b-df5c-246d-4552-5678653b3bd6 at redhat.com>:
> On 10/10/2016 06:58 PM, Eric Robinson wrote:
>> Thanks for the clarification. So what's the easiest way to ensure that the 
> cluster waits a desired timeout before deciding that a re-convergence is 
> necessary? 
> 
> By raising the token (lost) timeout I would say.

Somewhat off-topic:
I had always wished there were a kind of spreadsheet where you could play with those parameters, and together with required constraints you would be informed what consequences changing one parameter has. The interdependencies seem quite complex, some restictions seem hard, others soft, and some defaults seem to result from "black magic". Default values are not always documented, also.

Why does "a new configuration takes  about  50  milliseconds"? Where do they come from?

"It is not recommended to alter this value without guidance  from  the  corosync community." (token_retransmit being 238ms)

(Just two examples)

Some defaults could need explanation, e.g. why exactly 4 retransmits and not 2 or 3? Is the protocol expected to have a high loss rate?

Regards,
Ulrich

> 
> Please correct my (Chrissie) but I see the
> token (lost) timout somehow as resilience against
> static delays + jitter on top and the
> token_retransmits_before_loss_const as resilience
> against packet-loss.
> 
>>
>> --
>> Eric Robinson
>>    
>>
>> -----Original Message-----
>> From: Christine Caulfield [mailto:ccaulfie at redhat.com] 
>> Sent: Monday, October 10, 2016 4:34 AM
>> To: users at clusterlabs.org 
>> Subject: Re: [ClusterLabs] Establishing Timeouts
>>
>> On 10/10/16 05:51, Eric Robinson wrote:
>>> I have about a dozen corosync+pacemaker clusters and I am just now getting 
> around to understanding timeouts.
>>>
>>> Most of my corosync.conf files look something like this:
>>>
>>>         version:        2
>>>         token:          5000
>>>         token_retransmits_before_loss_const: 10
>>>         join:           1000
>>>         consensus:      7500
>>>         vsftype:        none
>>>         max_messages:   20
>>>         secauth:        off
>>>         threads:        0
>>>         clear_node_high_bit: yes
>>>         rrp_mode: active
>>>
>>> If I understand this correctly, this means the node will wait 50 seconds 
> (5000ms x 10) before deciding that a cluster reconfig is necessary (perhaps 
> after a link failure). Is that correct?
>>>
>> No that's not correct. the token timeout is 5 seconds in your example - 
> because token is 5000mS. the token timeout is always what the value of 
> totem.token is.
>>
>> token_retransmits_before_loss_const affects the token hold timeout - which is 
> how long the token is held on a node that has no messages to send before 
> being forwarded on. So increasing token_retransmits_before_loss_const changes 
> the number of times per 'token' timeout that the token is actually sent.
>>
>> In the example above you will see that the token is sent approximately
>> 5000/10 = 500 mS. That's approximate, the value is scaled slightly to make 
> actual timeouts less likely, and also is affected by messages that may beed 
> to be sent.
>>
>> Chrissie
>>
>>> I'm trying to understand how this works together with my bonded NIC's 
> arp_interval settings. I normally set arp_interval=1000. My question is, how 
> many arp losses are required before the bonding driver decides to failover to 
> the other link? If arp_interval=1000, how many times does the driver send an 
> arp and fail to receive a reply before it decides that the link is dead?
>>>
>>> I think I need to know this so I can set my corosync.conf settings correctly 
> to avoid "false positive" cluster failovers. In other words, if there is a 
> link or switch failure, I want to make sure that the cluster allows plenty of 
> time for link communication to recover before deciding that a node has 
> actually died. 
>>>
>>> --
>>> Eric Robinson
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org 
>>> http://clusterlabs.org/mailman/listinfo/users 
>>>
>>> Project Home: http://www.clusterlabs.org Getting started: 
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 







More information about the Users mailing list