[ClusterLabs] Antw: [EXT] Re: SBD fencing not working on my two-node cluster

Tue Sep 29 07:50:37 EDT 2020

>>> Strahil Nikolov <hunter86_bg at yahoo.com> schrieb am 22.09.2020 um 07:23 in
Nachricht <1814286403.4657404.1600752191237 at mail.yahoo.com>:
> Replace /dev/sde1 with 
> /dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1 :
> - in /etc/sysconfig/sbd
> - in the cib (via crom configure edit)
> 
> Also, I don't see 'stonith-enabled=true' which could be your actual
problem.
> 
> I think you can set it via :
> crm configure property stonith-enabled=true
> 
> P.S.: Consider setting the 'resource-stickiness' to '1'.Using partitions is

> not the best option but is better than nothing.

I think partitions are fine, especially when you have a modern SAN storage
where the smallest allocatable amount is 1GB.
The other thing I'd recommend is pre-allocating the message slots for each
cluster node.
Most importantly  /dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1
is probably the device to specify.

Regards,
Ulrich

> 
> Best Regards,
> Strahil Nikolov
> 
> 
> 
> 
> 
> 
> В вторник, 22 септември 2020 г., 02:06:10 Гринуич+3, Philippe M Stedman 
> <pmstedma at us.ibm.com> написа: 
> 
> 
> 
> 
> 
> Hi Strahil,
> 
> Here is the output of those commands.... I appreciate the help!
> 
> # crm config show
> node 1: ceha03 \
> attributes ethmonitor-ens192=1
> node 2: ceha04 \
> attributes ethmonitor-ens192=1
> (...)
> primitive stonith_sbd stonith:fence_sbd \
> params devices="/dev/sde1" \
> meta is-managed=true
> (...)
> property cib-bootstrap-options: \
> have-watchdog=true \
> dc-version=2.0.2-1.el8-744a30d655 \
> cluster-infrastructure=corosync \
> cluster-name=ps_dom \
> stonith-enabled=true \
> no-quorum-policy=ignore \
> stop-all-resources=false \
> cluster-recheck-interval=60 \
> symmetric-cluster=true \
> stonith-watchdog-timeout=0
> rsc_defaults rsc-options: \
> is-managed=false \
> resource-stickiness=0 \
> failure-timeout=1min
> 
> # cat /etc/sysconfig/sbd
> SBD_DEVICE="/dev/sde1"
> SBD_PACEMAKER=yes
> SBD_STARTMODE=always
> SBD_DELAY_START=no
> SBD_WATCHDOG_DEV=/dev/watchdog
> SBD_WATCHDOG_TIMEOUT=5
> SBD_TIMEOUT_ACTION=flush,reboot
> SBD_MOVE_TO_ROOT_CGROUP=auto
> SBD_OPTS=
> 
> # systemctl status sbd
> sbd.service - Shared-storage based fencing daemon
> Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset:

> disabled)
> Active: active (running) since Mon 2020-09-21 18:36:28 EDT; 15min ago
> Docs: man:sbd(8)
> Process: 12810 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch 
> (code=exited, status=0/SUCCESS)
> Main PID: 12812 (sbd)
> Tasks: 4 (limit: 26213)
> Memory: 14.5M
> CGroup: /system.slice/sbd.service
> \u251c\u250012812 sbd: inquisitor
> \u251c\u250012814 sbd: watcher: /dev/sde1 - slot: 0 - uuid: 
> 94d67f15-e301-4fa9-89ae-e3ce2e82c9e7
> \u251c\u250012815 sbd: watcher: Pacemaker
> \u2514\u250012816 sbd: watcher: Cluster
> 
> Sep 21 18:36:27 ceha03.canlab.ibm.com systemd[1]: Starting Shared-storage 
> based fencing daemon...
> Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12810]: notice: main: Doing flush

> + writing 'b' to sysrq on timeout
> Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12815]: pcmk: notice: 
> servant_pcmk: Monitoring Pacemaker health
> Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12816]: cluster: notice: 
> servant_cluster: Monitoring unknown cluster health
> Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12814]: /dev/sde1: notice: 
> servant_md: Monitoring slot 0 on disk /dev/sde1
> Sep 21 18:36:28 ceha03.canlab.ibm.com sbd[12812]: notice: watchdog_init: 
> Using watchdog device '/dev/watchdog'
> Sep 21 18:36:28 ceha03.canlab.ibm.com sbd[12816]: cluster: notice: 
> sbd_get_two_node: Corosync is in 2Node-mode
> Sep 21 18:36:28 ceha03.canlab.ibm.com sbd[12812]: notice: inquisitor_child:

> Servant cluster is healthy (age: 0)
> Sep 21 18:36:28 ceha03.canlab.ibm.com systemd[1]: Started Shared-storage 
> based fencing daemon.
> 
> # sbd -d /dev/disk/by-id/scsi-<long_uuid> dump
> [root at ceha03 by-id]# sbd -d 
> /dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1 dump
> ==Dumping header on disk 
> /dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1
> Header version : 2.1
> UUID : 94d67f15-e301-4fa9-89ae-e3ce2e82c9e7
> Number of slots : 255
> Sector size : 512
> Timeout (watchdog) : 5
> Timeout (allocate) : 2
> Timeout (loop) : 1
> Timeout (msgwait) : 10
> ==Header on disk 
> /dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1 is dumped
> 
> 
> Thanks,
> 
> Phil Stedman
> Db2 High Availability Development and Support
> Email: pmstedma at us.ibm.com 
> 
> Strahil Nikolov ---09/21/2020 01:41:10 PM---Can you provide (replace 
> sensitive data) : crm configure show
> 
> From: Strahil Nikolov <hunter86_bg at yahoo.com>
> To: "users at clusterlabs.org" <users at clusterlabs.org>
> Date: 09/21/2020 01:41 PM
> Subject: [EXTERNAL] Re: [ClusterLabs] SBD fencing not working on my two-node

> cluster
> Sent by: "Users" <users-bounces at clusterlabs.org>
> ________________________________
> 
> 
> 
> Can you provide (replace sensitive data) :
> 
> crm configure show
> cat /etc/sysconfig/sbd
> systemctl status sbd
> sbd -d /dev/disk/by-id/scsi-<long_uuid> dump
> 
> P.S.: It is very bad practice to use "/dev/sdXYZ" as these are not 
> permanent.Always use persistent names like those inside 
> "/dev/disk/by-XYZ/ZZZZ". Also , SBD needs max 10MB block device and yours 
> seems unnecessarily big.
> 
> 
> Most probably /dev/sde1 is your problem. 
> 
> Best Regards,
> Strahil Nikolov
> 
> 
> 
> 
> В понеделник, 21 септември 2020 г., 23:19:47 Гринуич+3, Philippe M Stedman 
> <pmstedma at us.ibm.com> написа: 
> 
> 
> 
> 
> 
> Hi,
> 
> I have been following the instructions on the following page to try and 
> configure SBD fencing on my two-node cluster:
> https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-storage-

> protect.html 
> 
> I am able to get through all the steps successfully, I am using the 
> following device (/dev/sde1) as my shared disk:
> 
> Disk /dev/sde: 20 GiB, 21474836480 bytes, 41943040 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disklabel type: gpt
> Disk identifier: 43987868-1C0B-41CE-8AF8-C522AB259655
> 
> Device Start End Sectors Size Type
> /dev/sde1 48 41942991 41942944 20G IBM General Parallel Fs
> 
> Since, I don't have a hardware watchdog at my disposal, I am using the 
> software watchdog (softdog) instead. Having said this, I am able to get 
> through all the steps successfully... I create the fence agent resource 
> successfully, it shows as Started in crm status output:
> 
> stonith_sbd (stonith:fence_sbd): Started ceha04
> 
> The problem is when I run crm node fence ceha04 to test out fencing a host 
> in my cluster. From the crm status output, I see that the reboot action has

> failed and furthermore, in the system logs, I see the following messages:
> 
> Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Requesting fencing

> (reboot) of node ceha04
> Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: notice: Client 
> pacemaker-controld.24146.5ff1ac0c wants to fence (reboot) 'ceha04' with 
> device '(any)'
> Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: notice: Requesting peer 
> fencing (reboot) of ceha04
> Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: notice: Couldn't find anyone

> to fence (reboot) ceha04 with any device
> Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: error: Operation reboot of 
> ceha04 by <no-one> for pacemaker-controld.24146 at ceha04.1bad3987: No such
device
> Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Stonith operation

> 3/1:4317:0:ec560474-96ea-4984-b801-400d11b5b3ae: No such device (-19)
> Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Stonith operation

> 3 for ceha04 failed (No such device): aborting transition.
> Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: warning: No devices found

> in cluster to fence ceha04, giving up
> Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Transition 4317 
> aborted: Stonith failed
> Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Peer ceha04 was 
> not terminated (reboot) by <anyone> on behalf of pacemaker-controld.24146:
No 
> such device
> 
> I don't know why Pacemaker isn't able to discover my fencing resource, why 
> isn't it able to find anyone to fence the host from the cluster?
> 
> Any help is greatly appreciated. I can provide more details as required.
> 
> Thanks,
> 
> Phil Stedman
> Db2 High Availability Development and Support
> Email: pmstedma at us.ibm.com 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/