[ClusterLabs] Sub-second failover detection in Corosync/Pacemaker clusters - 2026 update?

Windl, Ulrich u.windl at ukr.de
Wed Mar 4 07:45:19 UTC 2026


Hi!

Please forgive me for asking AI; anyway here is what AI suggested:
Quick clarification (Alice)
I need the environment details to give precise guidance:

  *   Linux distro and kernel version
  *   Corosync and Pacemaker versions
  *   Network topology (dedicated heartbeat VLAN vs shared LAN) and link types (1GbE/10GbE)
  *   Cluster size (nodes count) and the resources being failed over
  *   Do you control MTU, switches, and can you run multicast/UDPU or prefer TCP? Please provide those three essentials first: device/software+versions, error frequency (always/sometimes), any recent changes.
Below I assume a generic production setup (modern Corosync 3.x and Pacemaker 2.x on Linux).
Short answers

  1.  Yes — sub‑second detection and initiation can be achieved in production with modern Corosync/Pacemaker, but only with careful tuning, dedicated low‑latency networking, and acceptance of trade‑offs.
  2.  Community reports and measurements vary: stable failovers of ~200–700 ms have been reported for simple resources; 1s is a commonly cited reliable minimum in real deployments. Expect full resource recovery often to be longer (seconds) depending on service stop/start times.
  3.  Yes — post‑2015 releases improved messaging stacks, lower‑latency transports (UDPU improvements and CPR/TSN work), better quorum/stonith handling, and more tunable timeouts that reduce detection latency.
  4.  Key params and trade‑offs are listed below.
Quick Fix (try first, <2 minutes)

  *   Ensure cluster network has low jitter and packet loss (ping <1 ms on dedicated link). Confirm with: ping -i 0.1 for sample jitter.
  *   Verify corosync/pacemaker versions: corosync -v ; pcs --version (or pacemaker --version). Report results. (✅ or ❌)
Moderate steps (~5 minutes)

  1.  Use dedicated heartbeat network (separate VLAN/interface) and set proper MTU.
  2.  Tune Corosync:
     *   transport: use udp (UDPU) or the recommended transport for your version.
     *   token and consensus timeouts (example conservative starting point for sub‑second):
        *   token: 500 (ms)
        *   token_timeout: 1000 (ms)
        *   join/consensus timeouts: reduce join and hold timers modestly
     *   For Corosync 3.x, tune rrm and cpusets per docs (exact param names depend on version).
  3.  Pacemaker timeouts:
     *   cluster property stonith-enabled=true (ensure fencing is fast)
     *   set stonith-timeout and migration-thresholds to low safe values
     *   resource agent timeouts: set op monitor intervals to 200–500ms for services that support fast probes
  4.  Use lightweight health checks (fast monitors) rather than heavy scripts.
  5.  Test failovers under load and measure with timestamps in logs (corosync/pacemaker logs include timing).
Nuclear option

  *   If tuning cannot reach target, consider redesign: split services into smaller failover units, use active‑active instead of active‑passive, or replace with specialized low‑latency HA solutions. Back up configs and data before any disruptive changes.
Concrete example (decisive, reasonable defaults)

  *   corosync.conf (illustrative):
     *   token: 500
     *   token_timeout: 1000
     *   interface mcast/udpu with bindnetaddr on dedicated NIC
  *   Pacemaker resource monitor ops:
     *   op monitor interval="300ms" timeout="1500ms"
  *   Ensure fencing/stonith completes in <500ms (fast power fencing or virtualized APi).
Trade-offs and risks

  *   False positives: aggressive timeouts increase split‑brain/failover storm risk. Require rock‑solid network and reliable fencing.
  *   Network sensitivity: jitter/packet loss will cause spurious failovers. Dedicated link and QoS recommended.
  *   CPU/network overhead: frequent heartbeats and fast monitors increase CPU/network usage (generally small, but measurable on constrained nodes).
  *   Recovery time vs detection: detection can be sub‑second, but service stop/start and client reconnection often dominate total outage.
Reliable minimum recommendation

  *   For production, target ~500 ms detection/initiation as an aggressive but achievable baseline; 1 s is safer and widely reliable. Validate with end‑to‑end tests for your services.
If you want, provide the environment details I asked for and I’ll produce a tuned corosync.conf, pacemaker properties, concrete resource monitor examples, and a test plan with log commands and expected timestamps.

As always: Take AI answers with a grain of salt 😉

Kind regards,
Ulrich Windl

From: Users <users-bounces at clusterlabs.org> On Behalf Of Holger Haidinger <DE ERL SWD EM> via Users
Sent: Friday, February 20, 2026 4:41 PM
To: users at clusterlabs.org
Cc: Holger Haidinger <DE ERL SWD EM> <Holger.Haidinger at fluenceenergy.com>
Subject: [EXT] [EXT] [ClusterLabs] Sub-second failover detection in Corosync/Pacemaker clusters - 2026 update?

Notice: This email appears to be suspicious. Do not trust the information, links, or attachments in this email without verifying the source through a trusted method. For more information see: https://aka.ms/ProtectYourselfFromPhishing


Sicherheits-Hinweis: Diese E-Mail wurde von einer Person außerhalb des UKR gesendet. Seien Sie vorsichtig vor gefälschten Absendern, wenn Sie auf Links klicken, Anhänge öffnen oder weitere Aktionen ausführen, bevor Sie die Echtheit überprüft haben.
Hi everyone,

I'm revisiting a thread from 2015 (https://www.mail-archive.com/users@clusterlabs.org/msg00554.html) about achieving sub-second failover detection in HA clusters, and I'm curious about the current state of affairs nearly a decade later.

My Environment:

- Corosync 3.1.6
- Pacemaker 2.1.2
- Architecture: 2-node cluster + QDevice (also testing 3-node setups)
- Network: Dedicated physical NIC for cluster traffic (low-latency requirements)

Specific Questions:

1. With modern Corosync/Pacemaker versions, is sub-second fault detection and failover initiation realistically achievable in production environments?
2. Are there any published measurements or community experiences showing the fastest stable failover times you've achieved? What's considered a reliable minimum time span?
3. Have there been significant enhancements in the newer versions of Corosync and Pacemaker (post-2015) that specifically target detection speed and failover latency?
4. If sub-second detection is possible, what are the key configuration parameters and potential trade-offs (false positives, network sensitivity, resource overhead)?

Thanks in advance!

Holger Haidinger

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20260304/01cd4f6d/attachment-0001.htm>


More information about the Users mailing list