[ClusterLabs] problems with a CentOS7 SBD cluster

Tue Jun 28 03:04:32 UTC 2016

On Sun, Jun 26, 2016 at 6:05 AM, Marcin Dulak <marcin.dulak at gmail.com> wrote:
> Hi,
>
> I'm trying to get familiar with STONITH Block Devices (SBD) on a 3-node
> CentOS7 built in VirtualBox.
> The complete setup is available at
> https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git
> so hopefully with some help I'll be able to make it work.
>
> Question 1:
> The shared device /dev/sbd1 is the VirtualBox's "shareable hard disk"
> https://www.virtualbox.org/manual/ch05.html#hdimagewrites
> will SBD fencing work with that type of storage?

unknown

>
> I start the cluster using vagrant_1.8.1 and virtualbox-4.3 with:
> $ vagrant up  # takes ~15 minutes
>
> The setup brings up the nodes, installs the necessary packages, and prepares
> for the configuration of the pcs cluster.
> You can see which scripts the nodes execute at the bottom of the
> Vagrantfile.
> While there is 'yum -y install sbd' on CentOS7 the fence_sbd agent has not
> been packaged yet.

you're not supposed to use it

> Therefore I rebuild Fedora 24 package using the latest
> https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz
> plus the update to the fence_sbd from
> https://github.com/ClusterLabs/fence-agents/pull/73
>
> The configuration is inspired by
> https://www.novell.com/support/kb/doc.php?id=7009485 and
> https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html
>
> Question 2:
> After reading http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit I
> expect with just one stonith resource configured

there shouldn't be any stonith resources configured

> a node will be fenced when I stop pacemaker and corosync `pcs cluster stop
> node-1` or just `stonith_admin -F node-1`, but this is not the case.
>
> As can be seen below from uptime, the node-1 is not shutdown by `pcs cluster
> stop node-1` executed on itself.
> I found some discussions on users at clusterlabs.org about whether a node
> running SBD resource can fence itself,
> but the conclusion was not clear to me.

on RHEL and derivatives it can ONLY fence itself. the disk based
posion pill isn't supported yet

>
> Question 3:
> Neither node-1 is fenced by `stonith_admin -F node-1` executed on node-2,
> despite the fact
> /var/log/messages on node-2 (the one currently running MyStonith) reporting:
> ...
> notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) for host
> 'node-1' with device 'MyStonith' returned: 0 (OK)
> ...
> What is happening here?

have you tried looking at the sbd logs?
is the watchdog device functioning correctly?

>
> Question 4 (for the future):
> Assuming the node-1 was fenced, what is the way of operating SBD?
> I see the sbd lists now:
> 0       node-3  clear
> 1       node-1  off    node-2
> 2       node-2  clear
> How to clear the status of node-1?
>
> Question 5 (also for the future):
> While the relation 'stonith-timeout = Timeout (msgwait) + 20%' presented
> at
> https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html
> is clearly described, I wonder about the relation of 'stonith-timeout'
> to other timeouts like the 'monitor interval=60s' reported by `pcs stonith
> show MyStonith`.
>
> Here is how I configure the cluster and test it. The run.sh script is
> attached.
>
> $ sh -x run01.sh 2>&1 | tee run01.txt
>
> with the result:
>
> $ cat run01.txt
>
> Each block below shows the executed ssh command and the result.
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs cluster auth -u hacluster -p password node-1
> node-2 node-3'
> node-1: Authorized
> node-3: Authorized
> node-2: Authorized
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs cluster setup --name mycluster node-1 node-2
> node-3'
> Shutting down pacemaker/corosync services...
> Redirecting to /bin/systemctl stop  pacemaker.service
> Redirecting to /bin/systemctl stop  corosync.service
> Killing any remaining services...
> Removing all cluster configuration files...
> node-1: Succeeded
> node-2: Succeeded
> node-3: Succeeded
> Synchronizing pcsd certificates on nodes node-1, node-2, node-3...
> node-1: Success
> node-3: Success
> node-2: Success
> Restaring pcsd on the nodes in order to reload the certificates...
> node-1: Success
> node-3: Success
> node-2: Success
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs cluster start --all'
> node-3: Starting Cluster...
> node-2: Starting Cluster...
> node-1: Starting Cluster...
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'corosync-cfgtool -s'
> Printing ring status.
> Local node ID 1
> RING ID 0
>     id    = 192.168.10.11
>     status    = ring 0 active with no faults
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs status corosync'
> Membership information
> ----------------------
>     Nodeid      Votes Name
>          1          1 node-1 (local)
>          2          1 node-2
>          3          1 node-3
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs status'
> Cluster name: mycluster
> WARNING: no stonith devices and stonith-enabled is not false
> Last updated: Sat Jun 25 15:40:51 2016        Last change: Sat Jun 25
> 15:40:33 2016 by hacluster via crmd on node-2
> Stack: corosync
> Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with
> quorum
> 3 nodes and 0 resources configured
> Online: [ node-1 node-2 node-3 ]
> Full list of resources:
> PCSD Status:
>   node-1: Online
>   node-2: Online
>   node-3: Online
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
>
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list'
> 0    node-3    clear
> 1    node-2    clear
> 2    node-1    clear
>
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 dump'
> ==Dumping header on disk /dev/sdb1
> Header version     : 2.1
> UUID               : 79f28167-a207-4f2a-a723-aa1c00bf1dee
> Number of slots    : 255
> Sector size        : 512
> Timeout (watchdog) : 10
> Timeout (allocate) : 2
> Timeout (loop)     : 1
> Timeout (msgwait)  : 20
> ==Header on disk /dev/sdb1 is dumped
>
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs stonith list'
> fence_sbd - Fence agent for sbd
>
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs stonith create MyStonith fence_sbd
> devices=/dev/sdb1 power_timeout=21 action=off'
> ssh node-1 -c sudo su - -c 'pcs property set stonith-enabled=true'
> ssh node-1 -c sudo su - -c 'pcs property set stonith-timeout=24s'
> ssh node-1 -c sudo su - -c 'pcs property'
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: mycluster
>  dc-version: 1.1.13-10.el7_2.2-44eb2dd
>  have-watchdog: true
>  stonith-enabled: true
>  stonith-timeout: 24s
>  stonith-watchdog-timeout: 10s
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs stonith show MyStonith'
>  Resource: MyStonith (class=stonith type=fence_sbd)
>   Attributes: devices=/dev/sdb1 power_timeout=21 action=off
>   Operations: monitor interval=60s (MyStonith-monitor-interval-60s)
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs cluster stop node-1 '
> node-1: Stopping Cluster (pacemaker)...
> node-1: Stopping Cluster (corosync)...
>
>
>
> ############################
> ssh node-2 -c sudo su - -c 'pcs status'
> Cluster name: mycluster
> Last updated: Sat Jun 25 15:42:29 2016        Last change: Sat Jun 25
> 15:41:09 2016 by root via cibadmin on node-1
> Stack: corosync
> Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with
> quorum
> 3 nodes and 1 resource configured
> Online: [ node-2 node-3 ]
> OFFLINE: [ node-1 ]
> Full list of resources:
>  MyStonith    (stonith:fence_sbd):    Started node-2
> PCSD Status:
>   node-1: Online
>   node-2: Online
>   node-3: Online
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
>
>
>
> ############################
> ssh node-2 -c sudo su - -c 'stonith_admin -F node-1 '
>
>
>
> ############################
> ssh node-2 -c sudo su - -c 'grep stonith-ng /var/log/messages'
> Jun 25 15:40:11 localhost stonith-ng[3102]:  notice: Additional logging
> available in /var/log/cluster/corosync.log
> Jun 25 15:40:11 localhost stonith-ng[3102]:  notice: Connecting to cluster
> infrastructure: corosync
> Jun 25 15:40:11 localhost stonith-ng[3102]:  notice: crm_update_peer_proc:
> Node node-2[2] - state is now member (was (null))
> Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: Watching for stonith
> topology changes
> Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: Added 'watchdog' to the
> device list (1 active devices)
> Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: crm_update_peer_proc:
> Node node-3[3] - state is now member (was (null))
> Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: crm_update_peer_proc:
> Node node-1[1] - state is now member (was (null))
> Jun 25 15:40:12 localhost stonith-ng[3102]:  notice: New watchdog timeout
> 10s (was 0s)
> Jun 25 15:41:03 localhost stonith-ng[3102]:  notice: Relying on watchdog
> integration for fencing
> Jun 25 15:41:04 localhost stonith-ng[3102]:  notice: Added 'MyStonith' to
> the device list (2 active devices)
> Jun 25 15:41:54 localhost stonith-ng[3102]:  notice: crm_update_peer_proc:
> Node node-1[1] - state is now lost (was member)
> Jun 25 15:41:54 localhost stonith-ng[3102]:  notice: Removing node-1/1 from
> the membership list
> Jun 25 15:41:54 localhost stonith-ng[3102]:  notice: Purged 1 peers with
> id=1 and/or uname=node-1 from the membership cache
> Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: Client
> stonith_admin.3288.eb400ac9 wants to fence (off) 'node-1' with device
> '(any)'
> Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: Initiating remote
> operation off for node-1: 848cd1e9-55e4-4abc-8d7a-3762eaaf9ab4 (0)
> Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: watchdog can not fence
> (off) node-1: static-list
> Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: MyStonith can fence
> (off) node-1: dynamic-list
> Jun 25 15:42:33 localhost stonith-ng[3102]:  notice: watchdog can not fence
> (off) node-1: static-list
> Jun 25 15:42:54 localhost stonith-ng[3102]:  notice: Operation 'off' [3309]
> (call 2 from stonith_admin.3288) for host 'node-1' with device 'MyStonith'
> returned: 0 (OK)
> Jun 25 15:42:54 localhost stonith-ng[3102]:  notice: Operation off of node-1
> by node-2 for stonith_admin.3288 at node-2.848cd1e9: OK
> Jun 25 15:42:54 localhost stonith-ng[3102]: warning: new_event_notification
> (3102-3288-12): Broken pipe (32)
> Jun 25 15:42:54 localhost stonith-ng[3102]: warning: st_notify_fence
> notification of client stonith_admin.3288.eb400a failed: Broken pipe (-32)
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list'
> 0    node-3    clear
> 1    node-2    clear
> 2    node-1    off    node-2
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'uptime'
>  15:43:31 up 21 min,  2 users,  load average: 0.25, 0.18, 0.11
>
>
>
> Cheers,
>
> Marcin
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>