[ClusterLabs] SBD fencing not working on my two-node cluster

Mon Sep 21 19:06:04 EDT 2020

Hi Strahil,

Here is the output of those commands.... I appreciate the help!

# crm config show
node 1: ceha03 \
        attributes ethmonitor-ens192=1
node 2: ceha04 \
        attributes ethmonitor-ens192=1
(...)
primitive stonith_sbd stonith:fence_sbd \
        params devices="/dev/sde1" \
        meta is-managed=true
(...)
property cib-bootstrap-options: \
        have-watchdog=true \
        dc-version=2.0.2-1.el8-744a30d655 \
        cluster-infrastructure=corosync \
        cluster-name=ps_dom \
        stonith-enabled=true \
        no-quorum-policy=ignore \
        stop-all-resources=false \
        cluster-recheck-interval=60 \
        symmetric-cluster=true \
        stonith-watchdog-timeout=0
rsc_defaults rsc-options: \
        is-managed=false \
        resource-stickiness=0 \
        failure-timeout=1min

# cat /etc/sysconfig/sbd
SBD_DEVICE="/dev/sde1"
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_DELAY_START=no
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5
SBD_TIMEOUT_ACTION=flush,reboot
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_OPTS=

# systemctl status sbd
 sbd.service - Shared-storage based fencing daemon
   Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
preset: disabled)
   Active: active (running) since Mon 2020-09-21 18:36:28 EDT; 15min ago
     Docs: man:sbd(8)
  Process: 12810 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
watch (code=exited, status=0/SUCCESS)
 Main PID: 12812 (sbd)
    Tasks: 4 (limit: 26213)
   Memory: 14.5M
   CGroup: /system.slice/sbd.service
           \u251c\u250012812 sbd: inquisitor
           \u251c\u250012814 sbd: watcher: /dev/sde1 - slot: 0 - uuid:
94d67f15-e301-4fa9-89ae-e3ce2e82c9e7
           \u251c\u250012815 sbd: watcher: Pacemaker
           \u2514\u250012816 sbd: watcher: Cluster

Sep 21 18:36:27 ceha03.canlab.ibm.com systemd[1]: Starting Shared-storage
based fencing daemon...
Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12810]: notice: main: Doing flush
+ writing 'b' to sysrq on timeout
Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12815]: pcmk:   notice:
servant_pcmk: Monitoring Pacemaker health
Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12816]: cluster:   notice:
servant_cluster: Monitoring unknown cluster health
Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12814]: /dev/sde1:   notice:
servant_md: Monitoring slot 0 on disk /dev/sde1
Sep 21 18:36:28 ceha03.canlab.ibm.com sbd[12812]: notice: watchdog_init:
Using watchdog device '/dev/watchdog'
Sep 21 18:36:28 ceha03.canlab.ibm.com sbd[12816]: cluster:   notice:
sbd_get_two_node: Corosync is in 2Node-mode
Sep 21 18:36:28 ceha03.canlab.ibm.com sbd[12812]: notice: inquisitor_child:
Servant cluster is healthy (age: 0)
Sep 21 18:36:28 ceha03.canlab.ibm.com systemd[1]: Started Shared-storage
based fencing daemon.

# sbd -d /dev/disk/by-id/scsi-<long_uuid> dump
[root at ceha03 by-id]# sbd
-d /dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1 dump
==Dumping header on
disk /dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1
Header version     : 2.1
UUID               : 94d67f15-e301-4fa9-89ae-e3ce2e82c9e7
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 5
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 10
==Header on
disk /dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1 is dumped

Thanks,

Phil Stedman
Db2 High Availability Development and Support
Email: pmstedma at us.ibm.com

From:	Strahil Nikolov <hunter86_bg at yahoo.com>
To:	"users at clusterlabs.org" <users at clusterlabs.org>
Date:	09/21/2020 01:41 PM
Subject:	[EXTERNAL] Re: [ClusterLabs] SBD fencing not working on my
            two-node cluster
Sent by:	"Users" <users-bounces at clusterlabs.org>

Can you provide (replace sensitive data) :

crm configure show
cat /etc/sysconfig/sbd
systemctl status sbd
sbd -d /dev/disk/by-id/scsi-<long_uuid> dump

P.S.: It is very bad practice to use "/dev/sdXYZ" as these are not
permanent.Always use persistent names like those inside
"/dev/disk/by-XYZ/ZZZZ". Also , SBD needs max 10MB block device and yours
seems unnecessarily big.

Most probably /dev/sde1 is your problem.

Best Regards,
Strahil Nikolov

В понеделник, 21 септември 2020 г., 23:19:47 Гринуич+3, Philippe M Stedman
<pmstedma at us.ibm.com> написа:

Hi,

I have been following the instructions on the following page to try and
configure SBD fencing on my two-node cluster:
https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-storage-protect.html

I am able to get through all the steps successfully, I am using the
following device (/dev/sde1) as my shared disk:

Disk /dev/sde: 20 GiB, 21474836480 bytes, 41943040 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 43987868-1C0B-41CE-8AF8-C522AB259655

Device Start End Sectors Size Type
/dev/sde1 48 41942991 41942944 20G IBM General Parallel Fs

Since, I don't have a hardware watchdog at my disposal, I am using the
software watchdog (softdog) instead. Having said this, I am able to get
through all the steps successfully... I create the fence agent resource
successfully, it shows as Started in crm status output:

stonith_sbd (stonith:fence_sbd): Started ceha04

The problem is when I run crm node fence ceha04 to test out fencing a host
in my cluster. From the crm status output, I see that the reboot action has
failed and furthermore, in the system logs, I see the following messages:

Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Requesting
fencing (reboot) of node ceha04
Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: notice: Client
pacemaker-controld.24146.5ff1ac0c wants to fence (reboot) 'ceha04' with
device '(any)'
Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: notice: Requesting peer
fencing (reboot) of ceha04
Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: notice: Couldn't find
anyone to fence (reboot) ceha04 with any device
Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: error: Operation reboot of
ceha04 by <no-one> for pacemaker-controld.24146 at ceha04.1bad3987: No such
device
Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Stonith operation
3/1:4317:0:ec560474-96ea-4984-b801-400d11b5b3ae: No such device (-19)
Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Stonith operation
3 for ceha04 failed (No such device): aborting transition.
Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: warning: No devices found
in cluster to fence ceha04, giving up
Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Transition 4317
aborted: Stonith failed
Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Peer ceha04 was
not terminated (reboot) by <anyone> on behalf of pacemaker-controld.24146:
No such device

I don't know why Pacemaker isn't able to discover my fencing resource, why
isn't it able to find anyone to fence the host from the cluster?

Any help is greatly appreciated. I can provide more details as required.

Thanks,

Phil Stedman
Db2 High Availability Development and Support
Email: pmstedma at us.ibm.com

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home:
https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home:
https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20200921/b8df4f25/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20200921/b8df4f25/attachment-0001.gif>