[ClusterLabs] Establishing Timeouts

Mon Oct 10 18:58:41 CEST 2016

Thanks for the clarification. So what's the easiest way to ensure that the cluster waits a desired timeout before deciding that a re-convergence is necessary? 

--
Eric Robinson

-----Original Message-----
From: Christine Caulfield [mailto:ccaulfie at redhat.com] 
Sent: Monday, October 10, 2016 4:34 AM
To: users at clusterlabs.org
Subject: Re: [ClusterLabs] Establishing Timeouts

On 10/10/16 05:51, Eric Robinson wrote:
> I have about a dozen corosync+pacemaker clusters and I am just now getting around to understanding timeouts.
> 
> Most of my corosync.conf files look something like this:
> 
>         version:        2
>         token:          5000
>         token_retransmits_before_loss_const: 10
>         join:           1000
>         consensus:      7500
>         vsftype:        none
>         max_messages:   20
>         secauth:        off
>         threads:        0
>         clear_node_high_bit: yes
>         rrp_mode: active
> 
> If I understand this correctly, this means the node will wait 50 seconds (5000ms x 10) before deciding that a cluster reconfig is necessary (perhaps after a link failure). Is that correct?
> 

No that's not correct. the token timeout is 5 seconds in your example - because token is 5000mS. the token timeout is always what the value of totem.token is.

token_retransmits_before_loss_const affects the token hold timeout - which is how long the token is held on a node that has no messages to send before being forwarded on. So increasing token_retransmits_before_loss_const changes the number of times per 'token' timeout that the token is actually sent.

In the example above you will see that the token is sent approximately
5000/10 = 500 mS. That's approximate, the value is scaled slightly to make actual timeouts less likely, and also is affected by messages that may beed to be sent.

Chrissie

> I'm trying to understand how this works together with my bonded NIC's arp_interval settings. I normally set arp_interval=1000. My question is, how many arp losses are required before the bonding driver decides to failover to the other link? If arp_interval=1000, how many times does the driver send an arp and fail to receive a reply before it decides that the link is dead?
> 
> I think I need to know this so I can set my corosync.conf settings correctly to avoid "false positive" cluster failovers. In other words, if there is a link or switch failure, I want to make sure that the cluster allows plenty of time for link communication to recover before deciding that a node has actually died. 
> 
> --
> Eric Robinson
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

_______________________________________________
Users mailing list: Users at clusterlabs.org http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org