[ClusterLabs] Question about two level STONITH/fencing
Anton Gavriliuk
Anton.Gavriliuk at hpe.ua
Thu Feb 5 13:20:53 UTC 2026
I tried,
[root at memverge ~]# pcs stonith create watchdog-fencing fence_watchdog
But after that, the running cluster is hanging...., I can't run "crm_mon -Rr", “error: Lost connection to controller”
Perhaps this is due to /dev/watchdog is already managed by pacemaker ?
[root at memverge ~]# systemctl status sbd
● sbd.service - Shared-storage based fencing daemon
Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; preset: disabled)
Drop-In: /etc/systemd/system/sbd.service.d
└─override.conf
Active: active (running) since Tue 2026-02-03 16:09:00 EET; 1 day 22h ago
Invocation: 11a9ba526ef5403682980d67a886a7b9
Docs: man:sbd(8)
Main PID: 2473 (sbd)
Tasks: 3 (limit: 3355442)
Memory: 18.8M (peak: 19.5M)
CPU: 2min 22.568s
CGroup: /system.slice/sbd.service
├─2473 "sbd: inquisitor"
├─2487 "sbd: watcher: Pacemaker"
└─2488 "sbd: watcher: Cluster"
Feb 03 16:09:00 memverge sbd[2473]: notice: inquisitor_child: Servant cluster is healthy (age: 0)
Feb 03 16:09:00 memverge sbd[2473]: notice: watchdog_init: Using watchdog device '/dev/watchdog'
Feb 03 16:09:00 memverge systemd[1]: Started sbd.service - Shared-storage based fencing daemon.
Feb 03 16:09:04 memverge sbd[2473]: notice: inquisitor_child: Servant pcmk is healthy (age: 0)
Feb 03 16:11:27 memverge systemd[1]: /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of section. Ignoring.
Feb 03 16:11:28 memverge systemd[1]: /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of section. Ignoring.
Feb 03 16:25:02 memverge sbd[2473]: warning: inquisitor_child: pcmk health check: UNHEALTHY
Feb 03 16:25:02 memverge sbd[2473]: warning: inquisitor_child: Servant pcmk is outdated (age: 1246)
Feb 03 16:25:03 memverge sbd[2473]: notice: inquisitor_child: Servant pcmk is healthy (age: 0)
Feb 05 15:01:05 memverge systemd[1]: /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of section. Ignoring.
[root at memverge ~]#
Oh.., now it opened,
Cluster Summary:
* Stack: corosync (Pacemaker is running)
* Current DC: memverge (27) (version 3.0.1-3.el10-b1a23a6) - MIXED-VERSION partition with quorum
* Last updated: Thu Feb 5 15:14:45 2026
* Last change: Thu Feb 5 15:12:09 2026 by root via root on memverge
* 2 nodes configured
* 23 resource instances configured
Node List:
* Node memverge (27): online, feature set 3.20.1
* Node memverge2 (28): online, feature set <3.15.1
Full List of Resources:
* Resource Group: g-nfs:
* pb_nfs (ocf:heartbeat:portblock): Started memverge
* ip0_nfs (ocf:heartbeat:IPaddr2): Started memverge
* fs_nfs_internal_info_HA (ocf:heartbeat:Filesystem): Started memverge
* fs_nfsshare_exports_HA (ocf:heartbeat:Filesystem): Started memverge
* nfsserver (ocf:heartbeat:nfsserver): Started memverge
* expfs_nfsshare_exports_HA (ocf:heartbeat:exportfs): Started memverge
* samba_service (systemd:smb): Started memverge
* fs_sambashare_exports_HA (ocf:heartbeat:Filesystem): Started memverge
* punb_nfs (ocf:heartbeat:portblock): Started memverge
* Resource Group: g-iscsi:
* pb_iscsi (ocf:heartbeat:portblock): Started memverge
* ip0_iscsi (ocf:heartbeat:IPaddr2): Started memverge
* ip1_iscsi (ocf:heartbeat:IPaddr2): Started memverge
* iscsi_target (ocf:heartbeat:iSCSITarget): Started memverge
* iscsi_lun_drbd3 (ocf:heartbeat:iSCSILogicalUnit): Started memverge
* iscsi_lun_drbd4 (ocf:heartbeat:iSCSILogicalUnit): Started memverge
* punb_iscsi (ocf:heartbeat:portblock): Started memverge
* Clone Set: ha-nfs-clone [ha-nfs] (promotable):
* ha-nfs (ocf:linbit:drbd): Unpromoted memverge2
* ha-nfs (ocf:linbit:drbd): Promoted memverge
* Clone Set: ha-iscsi-clone [ha-iscsi] (promotable):
* ha-iscsi (ocf:linbit:drbd): Unpromoted memverge2
* ha-iscsi (ocf:linbit:drbd): Promoted memverge
* ipmi-fence-memverge (stonith:fence_ipmilan): Started memverge2
* ipmi-fence-memverge2 (stonith:fence_ipmilan): Started memverge
* watchdog-fencing (stonith:fence_watchdog): Starting memverge2
Failed Resource Actions:
* ipmi-fence-memverge_monitor_30000 on memverge2 'Error occurred' (1): call=93, status='Error', exitreason='Lost connection to fencer' * ipmi-fence-memveF
And there are so many records in /var/log/messages,
Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer connection failed (will retry): Transport endpoint is not connected
Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer connection failed (will retry): Transport endpoint is not connected
Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer connection failed (will retry): Transport endpoint is not connected
Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer connection failed (will retry): Transport endpoint is not connected
Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer connection failed (will retry): Transport endpoint is not connected
Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer connection failed (will retry): Transport endpoint is not connected
Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer connection failed (will retry): Transport endpoint is not connected
Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer connection failed (will retry): Transport endpoint is not connected
Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer connection failed (will retry): Transport endpoint is not connected
Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer connection failed (will retry): Transport endpoint is not connected
[root at memverge ~]#
I’m new in pacemaker/corosync, so it is quite complicated to me 😊
Or may be add fence_ipmilan as level 1 and don’t add sbd as level 2, assuming cluster should automatically detect it just because have-watchdog=true and fallback to sbd even without explicit as level 2 ?
Anton
From: Klaus Wenninger <kwenning at redhat.com>
Sent: Thursday, February 5, 2026 2:52 PM
To: Anton Gavriliuk <Anton.Gavriliuk at hpe.ua>
Cc: Andrei Borzenkov <arvidjaar at gmail.com>; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
Subject: Re: [ClusterLabs] Question about two level STONITH/fencing
On Thu, Feb 5, 2026 at 12:56 PM Anton Gavriliuk <Anton.Gavriliuk at hpe.ua<mailto:Anton.Gavriliuk at hpe.ua>> wrote:
Correct, in addition to two cluster nodes, there is dedicated 3rd node physical server as qdevice.
I'm thinking about two level fencing topology, 1st level - fence_ipmilan, 2nd - diskless sbd (hpwdt, /dev/watchdog)
But I can't add sbd as a 2nd level fencing,
[root at memverge2 ~]# pcs stonith level add 2 memverge watchdog
Error: Stonith resource(s) 'watchdog' do not exist, use --force to override
Error: Errors have occurred, therefore pcs is unable to continue
[root at memverge2 ~]#
So back to the original question - what is the most correct way of implementing STONITH/fencing with fence_iomilan + diskless sbd (hpwdt, /dev/watchdog) ?
Sorry then that I had overlooked qdevice (actually I thought I checked for it but ...).
For adding the watchdog into a topology you have to make it visible before - just add it as any fencing-device with fence_watchdog as agent.
There is a fence_watchdog script but that is just for the meta-data. Pacemaker will recognize that hand handle the actual fencing internally.
Regards,
Klaus
Anton
-----Original Message-----
From: Andrei Borzenkov <arvidjaar at gmail.com<mailto:arvidjaar at gmail.com>>
Sent: Thursday, February 5, 2026 1:17 PM
To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>
Cc: Anton Gavriliuk <Anton.Gavriliuk at hpe.ua<mailto:Anton.Gavriliuk at hpe.ua>>
Subject: Re: [ClusterLabs] Question about two level STONITH/fencing
On Thu, Feb 5, 2026 at 2:07 PM Klaus Wenninger <kwenning at redhat.com<mailto:kwenning at redhat.com>> wrote:
>
>
>
> On Wed, Feb 4, 2026 at 4:36 PM Anton Gavriliuk via Users <users at clusterlabs.org<mailto:users at clusterlabs.org>> wrote:
>>
>>
>>
>> Hello
>>
>>
>>
>> There is two-node (HPE DL345 Gen12 servers) shared-nothing DRBD-based sync (Protocol C) replication, distributed active/standby pacemaker storage metro-cluster. The distributed active/standby pacemaker storage metro-cluster configured with qdevice, heuristics (parallel fping) and fencing - fence_ipmilan and diskless sbd (hpwdt, /dev/watchdog). All cluster resources are configured to always run together on the same node.
>>
>>
>>
>> The two storage cluster nodes and qdevice running on Rocky Linux 10.1
>>
>> Pacemaker version 3.0.1
>>
>> Corosync version 3.1.9
>>
>> DRBD version 9.3.0
>>
>>
>>
>> So, the question is – what is the most correct way of implementing STONITH/fencing with fence_iomilan + diskless sbd (hpwdt, /dev/watchdog) ?
>
>
> The correct way of using diskless sbd with a two-node cluster is not
> to use it ;-)
>
> diskless sbd (watchdog-fencing) requires 'real' quorum and quorum
> provided by corosync in two-node mode would introduce split-brain
> which is the reason why sbd recognizes the two-node operation and
> replaces quorum from corosync by the information that the peer node is currently in the cluster. This is fine for working with poison-pill fencing - a single single shared disk then doesn't become a single-point-of-failure as long as the peer is there. But for watchdog-fencing that doesn't help because the peer going away would mean you have to commit suicide.
>
> and alternative with a two-node cluster is to step away from the actual two-node design and go with qdevice for 'real' quorum.
Hmm ... the original description does mention qdevice, although it is not quite clear where it is located (is there the third node?)
> You'll need some kind of 3rd node but it doesn't have to be a full cluster node.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20260205/0f438a45/attachment-0001.htm>
More information about the Users
mailing list