<div dir="ltr"><div>Hi,<br><br>I'm trying to get familiar with STONITH Block Devices (SBD) on a 3-node CentOS7 built in VirtualBox.<br>The complete setup is available at <a href="https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git">https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git</a><br>so hopefully with some help I'll be able to make it work.<br><br>Question 1:<br>The shared device /dev/sbd1 is the VirtualBox's "shareable hard disk" <a href="https://www.virtualbox.org/manual/ch05.html#hdimagewrites">https://www.virtualbox.org/manual/ch05.html#hdimagewrites</a><br>will SBD fencing work with that type of storage?<br><br>I start the cluster using vagrant_1.8.1 and virtualbox-4.3 with:<br>$ vagrant up  # takes ~15 minutes<br><br>The setup brings up the nodes, installs the necessary packages, and prepares for the configuration of the pcs cluster.<br>You can see which scripts the nodes execute at the bottom of the Vagrantfile.<br>While there is 'yum -y install sbd' on CentOS7 the fence_sbd agent has not been packaged yet.<br>Therefore I rebuild Fedora 24 package using the latest <a href="https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz">https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz</a><br>plus the update to the fence_sbd from <a href="https://github.com/ClusterLabs/fence-agents/pull/73">https://github.com/ClusterLabs/fence-agents/pull/73</a><br><br>The configuration is inspired by <a href="https://www.novell.com/support/kb/doc.php?id=7009485">https://www.novell.com/support/kb/doc.php?id=7009485</a> and<br><a href="https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html">https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html</a><br><br>Question 2:<br>After reading <a href="http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit">http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit</a> I expect with just one stonith resource configured<br>a

 node will be fenced when I stop pacemaker and corosync `pcs cluster 

stop node-1` or just `stonith_admin -F node-1`, but this is not the 

case.<br><br>As can be seen below from uptime, the node-1 is not shutdown by `pcs cluster stop node-1` executed on itself.<br>I found some discussions on <a href="mailto:users@clusterlabs.org">users@clusterlabs.org</a> about whether a node running SBD resource can fence itself,<br>but the conclusion was not clear to me.<br><br>Question 3:<br>Neither node-1 is fenced by `stonith_admin -F node-1` executed on node-2, despite the fact<br>/var/log/messages on node-2 (the one currently running MyStonith) reporting:<br>...<br>notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) for host 'node-1' with device 'MyStonith' returned: 0 (OK)

<br>...<br>What is happening here?<br><br>Question 4 (for the future):<br>Assuming the node-1 was fenced, what is the way of operating SBD?<br>I see the sbd lists now:<br>0       node-3  clear<br>1       node-1  off    node-2<br>2       node-2  clear<br>How to clear the status of node-1?<br><br>Question 5 (also for the future):<br>While the relation 'stonith-timeout = Timeout (msgwait) + 20%' presented<br>at <a href="https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html">https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html</a><br>is clearly described, I wonder about the relation of 'stonith-timeout'<br>to other timeouts like the 'monitor interval=60s' reported by `pcs stonith show MyStonith`.<br><br>Here is how I configure the cluster and test it. The run.sh script is attached.<br><br>$ sh -x run01.sh 2>&1 | tee run01.txt<br><br>with the result:<br><br>$ cat run01.txt<br><br></div>Each block below shows the executed ssh command and the result.<br><div><br>############################<br>ssh node-1 -c sudo su - -c 'pcs cluster auth -u hacluster -p password node-1 node-2 node-3'<br>node-1: Authorized

<br>node-3: Authorized

<br>node-2: Authorized

############################ ssh node-1 -c sudo su - -c 'pcs cluster setup --name mycluster node-1 node-2 node-3' Shutting down pacemaker/corosync services...

<br>Redirecting to /bin/systemctl stop  pacemaker.service

<br>Redirecting to /bin/systemctl stop  corosync.service

Killing any remaining services...

Removing all cluster configuration files...

<br>node-1: Succeeded

<br>node-2: Succeeded

<br>node-3: Succeeded

Synchronizing pcsd certificates on nodes node-1, node-2, node-3...

<br>node-1: Success

<br>node-3: Success

<br>node-2: Success 

Restaring pcsd on the nodes in order to reload the certificates...

<br>node-1: Success

<br>node-3: Success

<br>node-2: Success

<br><br><br><br>############################<br>ssh node-1 -c sudo su - -c 'pcs cluster start --all'<br>node-3: Starting Cluster...

node-2: Starting Cluster...

node-1: Starting Cluster...

<br><br><br><br>############################<br>ssh node-1 -c sudo su - -c 'corosync-cfgtool -s'<br>Printing ring status.

<br>Local node ID 1

<br>RING ID 0

<br>    id    = 192.168.10.11

<br>    status    = ring 0 active with no faults

<br><br><br>############################<br>ssh node-1 -c sudo su - -c 'pcs status corosync' <br>Membership information

<br>----------------------

<br>    Nodeid      Votes Name

<br>         1          1 node-1 (local)

<br>         2          1 node-2

<br>         3          1 node-3

<br><br><br><br>############################<br>ssh node-1 -c sudo su - -c 'pcs status'<br>Cluster name: mycluster

<br>WARNING: no stonith devices and stonith-enabled is not false

<br>Last updated: Sat Jun 25 15:40:51 2016        Last change: Sat Jun 25 15:40:33 2016 by hacluster via crmd on node-2

<br>Stack: corosync

<br>Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum

<br>3 nodes and 0 resources configured

<br>Online: [ node-1 node-2 node-3 ]

<br>Full list of resources:

<br>PCSD Status:

<br>  node-1: Online

<br>  node-2: Online

<br>  node-3: Online

<br>Daemon Status:

<br>  corosync: active/disabled

<br>  pacemaker: active/disabled

<br>  pcsd: active/enabled

<br><br><br><br><br>############################<br>ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list'<br>0    node-3    clear     <br>1    node-2    clear     <br>2    node-1    clear     <br><br><br><br><br>############################<br>ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 dump'<br>==Dumping header on disk /dev/sdb1

<br>Header version     : 2.1

<br>UUID               : 79f28167-a207-4f2a-a723-aa1c00bf1dee

<br>Number of slots    : 255

<br>Sector size        : 512

<br>Timeout (watchdog) : 10

<br>Timeout (allocate) : 2

<br>Timeout (loop)     : 1

<br>Timeout (msgwait)  : 20

<br>==Header on disk /dev/sdb1 is dumped

<br><br><br><br><br>############################<br>ssh node-1 -c sudo su - -c 'pcs stonith list'<br>fence_sbd - Fence agent for sbd

<br><br><br><br><br>############################<br>ssh node-1 -c sudo su - -c 'pcs stonith create MyStonith fence_sbd devices=/dev/sdb1 power_timeout=21 action=off'<br>ssh node-1 -c sudo su - -c 'pcs property set stonith-enabled=true'<br>ssh node-1 -c sudo su - -c 'pcs property set stonith-timeout=24s'<br>ssh node-1 -c sudo su - -c 'pcs property'<br>Cluster Properties:

<br> cluster-infrastructure: corosync

<br> cluster-name: mycluster

<br> dc-version: 1.1.13-10.el7_2.2-44eb2dd

<br> have-watchdog: true

<br> stonith-enabled: true

<br> stonith-timeout: 24s

<br> stonith-watchdog-timeout: 10s

<br><br><br><br>############################<br>ssh node-1 -c sudo su - -c 'pcs stonith show MyStonith'<br> Resource: MyStonith (class=stonith type=fence_sbd)

<br>  Attributes: devices=/dev/sdb1 power_timeout=21 action=off 

<br>  Operations: monitor interval=60s (MyStonith-monitor-interval-60s)

<br><br><br><br>############################<br>ssh node-1 -c sudo su - -c 'pcs cluster stop node-1 '<br>node-1: Stopping Cluster (pacemaker)...

<br>node-1: Stopping Cluster (corosync)... 

<br><br><br><br>############################<br>ssh node-2 -c sudo su - -c 'pcs status'<br>Cluster name: mycluster

<br>Last updated: Sat Jun 25 15:42:29 2016        Last change: Sat Jun 25 15:41:09 2016 by root via cibadmin on node-1

<br>Stack: corosync

<br>Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum

<br>3 nodes and 1 resource configured 

<br>Online: [ node-2 node-3 ]

<br>OFFLINE: [ node-1 ]

<br>Full list of resources: 

<br> MyStonith    (stonith:fence_sbd):    Started node-2

<br>PCSD Status:

<br>  node-1: Online

<br>  node-2: Online

<br>  node-3: Online 

<br>Daemon Status:

<br>  corosync: active/disabled

<br>  pacemaker: active/disabled

<br>  pcsd: active/enabled

<br><br><br><br>############################<br>ssh node-2 -c sudo su - -c 'stonith_admin -F node-1 '<br><br><br><br>############################<br>ssh node-2 -c sudo su - -c 'grep stonith-ng /var/log/messages'<br>Jun 25 15:40:11 localhost stonith-ng[3102]:  notice: Additional logging available in /var/log/cluster/corosync.log

<br>Jun 25 15:40:11 localhost stonith-ng[3102]:  notice: Connecting to cluster infrastructure: corosync

<br>Jun 25 15:40:11 localhost stonith-ng[3102]:  notice: crm_update_peer_proc: Node node-2[2] - state is now member (was (null))

<br>Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: Watching for stonith topology changes

<br>Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: Added 'watchdog' to the device list (1 active devices)

<br>Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: crm_update_peer_proc: Node node-3[3] - state is now member (was (null))

<br>Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: crm_update_peer_proc: Node node-1[1] - state is now member (was (null))

<br>Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: New watchdog timeout 10s (was 0s)

<br>Jun 25 15:41:03 localhost stonith-ng[3102]:  notice: Relying on watchdog integration for fencing

<br>Jun 25 15:41:04 localhost stonith-ng[3102]:  notice: Added 'MyStonith' to the device list (2 active devices)

<br>Jun 25 15:41:54 localhost stonith-ng[3102]:  notice: crm_update_peer_proc: Node node-1[1] - state is now lost (was member)

<br>Jun 25 15:41:54 localhost stonith-ng[3102]:  notice: Removing node-1/1 from the membership list

<br>Jun 25 15:41:54 localhost stonith-ng[3102]:  notice: Purged 1 peers with id=1 and/or uname=node-1 from the membership cache

<br>Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: Client stonith_admin.3288.eb400ac9 wants to fence (off) 'node-1' with device '(any)'

<br>Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: Initiating remote operation off for node-1: 848cd1e9-55e4-4abc-8d7a-3762eaaf9ab4 (0)

<br>Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: watchdog can not fence (off) node-1: static-list

<br>Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: MyStonith can fence (off) node-1: dynamic-list

<br>Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: watchdog can not fence (off) node-1: static-list

<br>Jun 25 15:42:54 localhost stonith-ng[3102]:  notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) for host 'node-1' with device 'MyStonith' returned: 0 (OK)

<br>Jun 25 15:42:54 localhost stonith-ng[3102]:  notice: Operation off of node-1 by node-2 for stonith_admin.3288@node-2.848cd1e9: OK

<br>Jun 25 15:42:54 localhost stonith-ng[3102]: warning: new_event_notification (3102-3288-12): Broken pipe (32)

<br>Jun 25 15:42:54 localhost stonith-ng[3102]: warning: st_notify_fence notification of client stonith_admin.3288.eb400a failed: Broken pipe (-32)

<br><br><br><br>############################<br>ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list'<br>0    node-3    clear     <br>1    node-2    clear     <br>2    node-1    off    node-2

<br><br><br><br>############################<br>ssh node-1 -c sudo su - -c 'uptime'<br> 15:43:31 up 21 min,  2 users,  load average: 0.25, 0.18, 0.11

<br><br><br><br></div><div>Cheers,<br><br></div><div>Marcin<br></div><div><br></div></div>