[ClusterLabs] Sub-second failover detection in Corosync/Pacemaker clusters - 2026 update?

Wed Mar 4 07:40:57 UTC 2026

Hi!

I cannot answer your questions actually, but there is a comnpromise with all timeouts:
If it’s too long, you have an unnecessary service outage, but when it’s too short, you activate recovery mechanisms without actual need. With a sub-second reaction time, you may have additional trouble.
Just one example: In a SAN RAID storage a single disk with bad sectors caused significant read delays (while the disk retried reading and the controller did not mark the disk as bad (in this case the timeout was too long, helping the vendor not to have to replace the disk). In such a case switching to another node accessing the same disks would not help…

Mit kollegialen Grüßen
Ulrich Windl

From: Users <users-bounces at clusterlabs.org> On Behalf Of Holger Haidinger <DE ERL SWD EM> via Users
Sent: Friday, February 20, 2026 4:41 PM
To: users at clusterlabs.org
Cc: Holger Haidinger <DE ERL SWD EM> <Holger.Haidinger at fluenceenergy.com>
Subject: [EXT] [EXT] [ClusterLabs] Sub-second failover detection in Corosync/Pacemaker clusters - 2026 update?

Hi everyone,

I'm revisiting a thread from 2015 (https://www.mail-archive.com/users@clusterlabs.org/msg00554.html) about achieving sub-second failover detection in HA clusters, and I'm curious about the current state of affairs nearly a decade later.

My Environment:

- Corosync 3.1.6
- Pacemaker 2.1.2
- Architecture: 2-node cluster + QDevice (also testing 3-node setups)
- Network: Dedicated physical NIC for cluster traffic (low-latency requirements)

Specific Questions:

1. With modern Corosync/Pacemaker versions, is sub-second fault detection and failover initiation realistically achievable in production environments?
2. Are there any published measurements or community experiences showing the fastest stable failover times you've achieved? What's considered a reliable minimum time span?
3. Have there been significant enhancements in the newer versions of Corosync and Pacemaker (post-2015) that specifically target detection speed and failover latency?
4. If sub-second detection is possible, what are the key configuration parameters and potential trade-offs (false positives, network sensitivity, resource overhead)?

Thanks in advance!

Holger Haidinger

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20260304/69ad9e36/attachment.htm>