[ClusterLabs] Question about two level STONITH/fencing
Klaus Wenninger
kwenning at redhat.com
Fri Feb 6 14:41:08 UTC 2026
On Thu, Feb 5, 2026 at 8:07 PM Anton Gavriliuk <Anton.Gavriliuk at hpe.ua>
wrote:
>
> - The other way round: pcs stonith create watchdog fence_watchdog
>
>
>
> Yes, that works, thank you! After creation it automatically started on 2nd
> node – memverge2
>
>
>
> Cluster Summary:
>
> * Stack: corosync (Pacemaker is running)
>
> * Current DC: memverge2 (28) (version 3.0.1-3.el10-b1a23a6) - partition
> with quorum
>
> * Last updated: Thu Feb 5 21:02:49 2026 on memverge
>
> * Last change: Thu Feb 5 21:01:00 2026 by root via root on memverge
>
> * 2 nodes configured
>
> * 23 resource instances configured
>
>
>
> Node List:
>
> * Node memverge (27): online, feature set 3.20.1
>
> * Node memverge2 (28): online, feature set 3.20.1
>
>
>
> Full List of Resources:
>
> * Resource Group: g-nfs:
>
> * pb_nfs (ocf:heartbeat:portblock): Started memverge
>
> * ip0_nfs (ocf:heartbeat:IPaddr2): Started memverge
>
> * fs_nfs_internal_info_HA (ocf:heartbeat:Filesystem): Started
> memverge
>
> * fs_nfsshare_exports_HA (ocf:heartbeat:Filesystem): Started
> memverge
>
> * nfsserver (ocf:heartbeat:nfsserver): Started memverge
>
> * expfs_nfsshare_exports_HA (ocf:heartbeat:exportfs): Started
> memverge
>
> * samba_service (systemd:smb): Started memverge
>
> * fs_sambashare_exports_HA (ocf:heartbeat:Filesystem): Started
> memverge
>
> * punb_nfs (ocf:heartbeat:portblock): Started memverge
>
> * Resource Group: g-iscsi:
>
> * pb_iscsi (ocf:heartbeat:portblock): Started memverge
>
> * ip0_iscsi (ocf:heartbeat:IPaddr2): Started memverge
>
> * ip1_iscsi (ocf:heartbeat:IPaddr2): Started memverge
>
> * iscsi_target (ocf:heartbeat:iSCSITarget): Started memverge
>
> * iscsi_lun_drbd3 (ocf:heartbeat:iSCSILogicalUnit): Started
> memverge
>
> * iscsi_lun_drbd4 (ocf:heartbeat:iSCSILogicalUnit): Started
> memverge
>
> * punb_iscsi (ocf:heartbeat:portblock): Started memverge
>
> * Clone Set: ha-nfs-clone [ha-nfs] (promotable):
>
> * ha-nfs (ocf:linbit:drbd): Promoted memverge
>
> * ha-nfs (ocf:linbit:drbd): Unpromoted memverge2
>
> * Clone Set: ha-iscsi-clone [ha-iscsi] (promotable):
>
> * ha-iscsi (ocf:linbit:drbd): Promoted memverge
>
> * ha-iscsi (ocf:linbit:drbd): Unpromoted memverge2
>
> * ipmi-fence-memverge (stonith:fence_ipmilan): Started memverge2
>
> * ipmi-fence-memverge2 (stonith:fence_ipmilan): Started
> memverge
>
> * watchdog (stonith:fence_watchdog): Started memverge2
>
>
>
> But I assume I should create the same for 1st node – memverge ?
>
Probably you will not need a 2nd instance. That is as with any other
fencing-resource where
usually monitoring would be running. But that isn't doing anything with
watchdog iirc anyway.
Klaus
>
>
> Anton
>
>
>
> *From:* Klaus Wenninger <kwenning at redhat.com>
> *Sent:* Thursday, February 5, 2026 4:16 PM
> *To:* Anton Gavriliuk <Anton.Gavriliuk at hpe.ua>
> *Cc:* Andrei Borzenkov <arvidjaar at gmail.com>; Cluster Labs - All topics
> related to open-source clustering welcomed <users at clusterlabs.org>
> *Subject:* Re: [ClusterLabs] Question about two level STONITH/fencing
>
>
>
>
>
>
>
> On Thu, Feb 5, 2026 at 3:07 PM Anton Gavriliuk <Anton.Gavriliuk at hpe.ua>
> wrote:
>
>
> - But sry again I forgot to mention that the fence-resource has to be
> called 'watchdog' otherwise pacemaker won't align it with the already
> existent (if you have stonith-watchdog-timeout != 0) internal hidden
> device.
>
>
>
> [root at memverge ~]# pcs stonith create watchdog-fencing watchdog
>
> Error: Agent 'stonith:watchdog' is not installed or does not provide valid
> metadata: crm_resource: Metadata query for stonith:watchdog failed: No such
> device or address, Error performing operation: No such object, use --force
> to override
>
> Error: Errors have occurred, therefore pcs is unable to continue
>
>
>
> The other way round: pcs stonith create watchdog fence_watchdog
>
>
>
> [root at memverge ~]#
>
>
>
> - Can you provide your cib & corosync-config as that we don't have to
> write back and forth that often?
>
>
>
> I attached it in the files.
>
>
>
> Anton
>
>
>
> *From:* Klaus Wenninger <kwenning at redhat.com>
> *Sent:* Thursday, February 5, 2026 3:42 PM
> *To:* Anton Gavriliuk <Anton.Gavriliuk at hpe.ua>
> *Cc:* Andrei Borzenkov <arvidjaar at gmail.com>; Cluster Labs - All topics
> related to open-source clustering welcomed <users at clusterlabs.org>
> *Subject:* Re: [ClusterLabs] Question about two level STONITH/fencing
>
>
>
>
>
>
>
> On Thu, Feb 5, 2026 at 2:21 PM Anton Gavriliuk <Anton.Gavriliuk at hpe.ua>
> wrote:
>
> I tried,
>
>
>
> [root at memverge ~]# pcs stonith create watchdog-fencing fence_watchdog
>
>
>
> But after that, the running cluster is hanging...., I can't run "crm_mon
> -Rr", “error: Lost connection to controller”
>
>
>
> Perhaps this is due to /dev/watchdog is already managed by pacemaker ?
>
>
>
> [root at memverge ~]# systemctl status sbd
>
> ● sbd.service - Shared-storage based fencing daemon
>
> Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; preset:
> disabled)
>
> Drop-In: /etc/systemd/system/sbd.service.d
>
> └─override.conf
>
> Active: active (running) since Tue 2026-02-03 16:09:00 EET; 1 day 22h
> ago
>
> Invocation: 11a9ba526ef5403682980d67a886a7b9
>
> Docs: man:sbd(8)
>
> Main PID: 2473 (sbd)
>
> Tasks: 3 (limit: 3355442)
>
> Memory: 18.8M (peak: 19.5M)
>
> CPU: 2min 22.568s
>
> CGroup: /system.slice/sbd.service
>
> ├─2473 "sbd: inquisitor"
>
> ├─2487 "sbd: watcher: Pacemaker"
>
> └─2488 "sbd: watcher: Cluster"
>
>
>
> Feb 03 16:09:00 memverge sbd[2473]: notice: inquisitor_child: Servant
> cluster is healthy (age: 0)
>
> Feb 03 16:09:00 memverge sbd[2473]: notice: watchdog_init: Using
> watchdog device '/dev/watchdog'
>
> Feb 03 16:09:00 memverge systemd[1]: Started sbd.service - Shared-storage
> based fencing daemon.
>
> Feb 03 16:09:04 memverge sbd[2473]: notice: inquisitor_child: Servant
> pcmk is healthy (age: 0)
>
> Feb 03 16:11:27 memverge systemd[1]:
> /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of
> section. Ignoring.
>
> Feb 03 16:11:28 memverge systemd[1]:
> /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of
> section. Ignoring.
>
> Feb 03 16:25:02 memverge sbd[2473]: warning: inquisitor_child: pcmk
> health check: UNHEALTHY
>
> Feb 03 16:25:02 memverge sbd[2473]: warning: inquisitor_child: Servant
> pcmk is outdated (age: 1246)
>
> Feb 03 16:25:03 memverge sbd[2473]: notice: inquisitor_child: Servant
> pcmk is healthy (age: 0)
>
> Feb 05 15:01:05 memverge systemd[1]:
> /etc/systemd/system/sbd.service.d/override.conf:1: Assignment outside of
> section. Ignoring.
>
> [root at memverge ~]#
>
>
>
> Oh.., now it opened,
>
>
>
> Cluster Summary:
>
> * Stack: corosync (Pacemaker is running)
>
> * Current DC: memverge (27) (version 3.0.1-3.el10-b1a23a6) -
> MIXED-VERSION partition with quorum
>
> * Last updated: Thu Feb 5 15:14:45 2026
>
> * Last change: Thu Feb 5 15:12:09 2026 by root via root on memverge
>
> * 2 nodes configured
>
> * 23 resource instances configured
>
>
>
> Node List:
>
> * Node memverge (27): online, feature set 3.20.1
>
> * Node memverge2 (28): online, feature set <3.15.1
>
>
>
> Full List of Resources:
>
> * Resource Group: g-nfs:
>
> * pb_nfs (ocf:heartbeat:portblock): Started memverge
>
> * ip0_nfs (ocf:heartbeat:IPaddr2): Started memverge
>
> * fs_nfs_internal_info_HA (ocf:heartbeat:Filesystem): Started
> memverge
>
> * fs_nfsshare_exports_HA (ocf:heartbeat:Filesystem): Started
> memverge
>
> * nfsserver (ocf:heartbeat:nfsserver): Started memverge
>
> * expfs_nfsshare_exports_HA (ocf:heartbeat:exportfs): Started
> memverge
>
> * samba_service (systemd:smb): Started memverge
>
> * fs_sambashare_exports_HA (ocf:heartbeat:Filesystem): Started
> memverge
>
> * punb_nfs (ocf:heartbeat:portblock): Started memverge
>
> * Resource Group: g-iscsi:
>
> * pb_iscsi (ocf:heartbeat:portblock): Started memverge
>
> * ip0_iscsi (ocf:heartbeat:IPaddr2): Started memverge
>
> * ip1_iscsi (ocf:heartbeat:IPaddr2): Started memverge
>
> * iscsi_target (ocf:heartbeat:iSCSITarget): Started memverge
>
> * iscsi_lun_drbd3 (ocf:heartbeat:iSCSILogicalUnit): Started
> memverge
>
> * iscsi_lun_drbd4 (ocf:heartbeat:iSCSILogicalUnit): Started
> memverge
>
> * punb_iscsi (ocf:heartbeat:portblock): Started memverge
>
> * Clone Set: ha-nfs-clone [ha-nfs] (promotable):
>
> * ha-nfs (ocf:linbit:drbd): Unpromoted memverge2
>
> * ha-nfs (ocf:linbit:drbd): Promoted memverge
>
> * Clone Set: ha-iscsi-clone [ha-iscsi] (promotable):
>
> * ha-iscsi (ocf:linbit:drbd): Unpromoted memverge2
>
> * ha-iscsi (ocf:linbit:drbd): Promoted memverge
>
> * ipmi-fence-memverge (stonith:fence_ipmilan): Started memverge2
>
> * ipmi-fence-memverge2 (stonith:fence_ipmilan): Started
> memverge
>
> * watchdog-fencing (stonith:fence_watchdog): Starting memverge2
>
>
>
> Failed Resource Actions:
>
> * ipmi-fence-memverge_monitor_30000 on memverge2 'Error occurred' (1):
> call=93, status='Error', exitreason='Lost connection to fencer' *
> ipmi-fence-memveF
>
>
>
> And there are so many records in /var/log/messages,
>
>
>
> Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> Feb 5 15:13:10 memverge pacemaker-controld[755570]: notice: Fencer
> connection failed (will retry): Transport endpoint is not connected
>
> [root at memverge ~]#
>
>
>
> I’m new in pacemaker/corosync, so it is quite complicated to me 😊
>
> Or may be add fence_ipmilan as level 1 and don’t add sbd as level 2,
> assuming cluster should automatically detect it just because
> have-watchdog=true and fallback to sbd even without explicit as level 2 ?
>
>
>
> Not sure what we're seeing. The 'Fencer connection failed ...' thing would
> point to pacemaker-fenced having had a segfault or something.
>
> You might see traces of that elsewhere. And it would explain strange
> behavior of pacemaker in general if it is constantly trying to
>
> restart pacemaker-fenced.
>
> But sry again I forgot to mention that the fence-resource has to be called
> 'watchdog' otherwise pacemaker won't align it with the already
> existent (if you have stonith-watchdog-timeout != 0) internal hidden
> device.
>
> If not doing so this is probably untested (Don't remember if I had tested
> that during development of the feature. It is definitely not a test-case
> for CI or something.) and might lead to pacemaker-fenced having an issue.
> So this should probably be fixed but if you use the correct
>
> naming it should work.
>
> Can you provide your cib & corosync-config as that we don't have to write
> back and forth that often?
>
>
>
> Regards,
>
> Klaus
>
>
>
> Anton
>
>
>
> *From:* Klaus Wenninger <kwenning at redhat.com>
> *Sent:* Thursday, February 5, 2026 2:52 PM
> *To:* Anton Gavriliuk <Anton.Gavriliuk at hpe.ua>
> *Cc:* Andrei Borzenkov <arvidjaar at gmail.com>; Cluster Labs - All topics
> related to open-source clustering welcomed <users at clusterlabs.org>
> *Subject:* Re: [ClusterLabs] Question about two level STONITH/fencing
>
>
>
>
>
>
>
> On Thu, Feb 5, 2026 at 12:56 PM Anton Gavriliuk <Anton.Gavriliuk at hpe.ua>
> wrote:
>
>
> Correct, in addition to two cluster nodes, there is dedicated 3rd node
> physical server as qdevice.
>
> I'm thinking about two level fencing topology, 1st level - fence_ipmilan,
> 2nd - diskless sbd (hpwdt, /dev/watchdog)
>
> But I can't add sbd as a 2nd level fencing,
>
> [root at memverge2 ~]# pcs stonith level add 2 memverge watchdog
> Error: Stonith resource(s) 'watchdog' do not exist, use --force to override
> Error: Errors have occurred, therefore pcs is unable to continue
> [root at memverge2 ~]#
>
> So back to the original question - what is the most correct way of
> implementing STONITH/fencing with fence_iomilan + diskless sbd (hpwdt,
> /dev/watchdog) ?
>
>
>
> Sorry then that I had overlooked qdevice (actually I thought I checked for
> it but ...).
>
> For adding the watchdog into a topology you have to make it visible before
> - just add it as any fencing-device with fence_watchdog as agent.
>
> There is a fence_watchdog script but that is just for the meta-data.
> Pacemaker will recognize that hand handle the actual fencing internally.
>
>
>
> Regards,
>
> Klaus
>
>
>
>
> Anton
>
>
> -----Original Message-----
> From: Andrei Borzenkov <arvidjaar at gmail.com>
> Sent: Thursday, February 5, 2026 1:17 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed <
> users at clusterlabs.org>
> Cc: Anton Gavriliuk <Anton.Gavriliuk at hpe.ua>
> Subject: Re: [ClusterLabs] Question about two level STONITH/fencing
>
> On Thu, Feb 5, 2026 at 2:07 PM Klaus Wenninger <kwenning at redhat.com>
> wrote:
> >
> >
> >
> > On Wed, Feb 4, 2026 at 4:36 PM Anton Gavriliuk via Users <
> users at clusterlabs.org> wrote:
> >>
> >>
> >>
> >> Hello
> >>
> >>
> >>
> >> There is two-node (HPE DL345 Gen12 servers) shared-nothing DRBD-based
> sync (Protocol C) replication, distributed active/standby pacemaker storage
> metro-cluster. The distributed active/standby pacemaker storage
> metro-cluster configured with qdevice, heuristics (parallel fping) and
> fencing - fence_ipmilan and diskless sbd (hpwdt, /dev/watchdog). All
> cluster resources are configured to always run together on the same node.
> >>
> >>
> >>
> >> The two storage cluster nodes and qdevice running on Rocky Linux 10.1
> >>
> >> Pacemaker version 3.0.1
> >>
> >> Corosync version 3.1.9
> >>
> >> DRBD version 9.3.0
> >>
> >>
> >>
> >> So, the question is – what is the most correct way of implementing
> STONITH/fencing with fence_iomilan + diskless sbd (hpwdt, /dev/watchdog) ?
> >
> >
> > The correct way of using diskless sbd with a two-node cluster is not
> > to use it ;-)
> >
> > diskless sbd (watchdog-fencing) requires 'real' quorum and quorum
> > provided by corosync in two-node mode would introduce split-brain
> > which is the reason why sbd recognizes the two-node operation and
> > replaces quorum from corosync by the information that the peer node is
> currently in the cluster. This is fine for working with poison-pill fencing
> - a single single shared disk then doesn't become a single-point-of-failure
> as long as the peer is there. But for watchdog-fencing that doesn't help
> because the peer going away would mean you have to commit suicide.
> >
> > and alternative with a two-node cluster is to step away from the actual
> two-node design and go with qdevice for 'real' quorum.
>
> Hmm ... the original description does mention qdevice, although it is not
> quite clear where it is located (is there the third node?)
>
> > You'll need some kind of 3rd node but it doesn't have to be a full
> cluster node.
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20260206/759c55a0/attachment-0001.htm>
More information about the Users
mailing list