[ClusterLabs Developers] How to calculate total host failure detection time in Corosync?

Vicki Chen vchen at ca.ibm.com
Thu Jun 12 15:30:15 UTC 2025


I’d greatly appreciate it if anyone could provide an answer to this. Thank you!

best regards,

Vicki
________________________________
From: Developers <developers-bounces at clusterlabs.org> on behalf of Vicki Chen <vchen at ca.ibm.com>
Sent: May 30, 2025 11:37 AM
To: developers at clusterlabs.org <developers at clusterlabs.org>
Subject: [EXTERNAL] [ClusterLabs Developers] How to calculate total host failure detection time in Corosync?

Hi, I’ve been researching the Corosync communication layer and would like to understand how to calculate the total failure timeout for a host. From what I’ve gathered, the relevant parameters include the base token (defined in corosync. conf),

Hi,

I’ve been researching the Corosync communication layer and would like to understand how to calculate the total failure timeout for a host. From what I’ve gathered, the relevant parameters include the base token (defined in corosync.conf), the runtime token timeout (runtime.config.totem.token), as well as token_retransmit, token_retransmit_before_loss_const, and consensus. Could you please clarify how these values contribute to the overall failure detection time?

runtime.config.totem.token = base token + (number of nodes - 2) * token_coefficient
Total failure detection time = runtime.config.totem.token + (token_retransmit x token_retransmit_before_loss_const)

consensus = 1.2 * runtime.config.totem.token

For example: 3 servers
base token (from corosync.conf) = 2000ms
runtime.config.totem.token = 2650ms
token_coefficient = 650ms
token_retransmit = 1000ms
token_retransmit_before_loss_const = 4
consensus = 3180
Are those values correct?

For example, if Server 2 goes down and the real token timeout (runtime.config.totem.token) is set to 2650 ms, the token is retransmitted 4 times at 1000 ms intervals, total 4000 ms. Added together, the total failure timeout is 6650 ms before the node is declared failed. Is that correct?
Then how does the consensus timeout work? After the 6650 ms timeout, the node is declared down. Does the system need to remove the node within the 3180 ms consensus timeout? Is there no grace period in Corosync? Is my analysis correct? Thank you!

best regards,

Vicki Chen


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/developers/attachments/20250612/d0591aed/attachment.htm>


More information about the Developers mailing list