[ClusterLabs] problems with a CentOS7 SBD cluster
Klaus Wenninger
kwenning at redhat.com
Tue Jun 28 09:51:43 UTC 2016
On 06/28/2016 11:24 AM, Marcin Dulak wrote:
>
>
> On Tue, Jun 28, 2016 at 5:04 AM, Andrew Beekhof <abeekhof at redhat.com
> <mailto:abeekhof at redhat.com>> wrote:
>
> On Sun, Jun 26, 2016 at 6:05 AM, Marcin Dulak
> <marcin.dulak at gmail.com <mailto:marcin.dulak at gmail.com>> wrote:
> > Hi,
> >
> > I'm trying to get familiar with STONITH Block Devices (SBD) on a
> 3-node
> > CentOS7 built in VirtualBox.
> > The complete setup is available at
> > https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git
> > so hopefully with some help I'll be able to make it work.
> >
> > Question 1:
> > The shared device /dev/sbd1 is the VirtualBox's "shareable hard
> disk"
> > https://www.virtualbox.org/manual/ch05.html#hdimagewrites
> > will SBD fencing work with that type of storage?
>
> unknown
>
> >
> > I start the cluster using vagrant_1.8.1 and virtualbox-4.3 with:
> > $ vagrant up # takes ~15 minutes
> >
> > The setup brings up the nodes, installs the necessary packages,
> and prepares
> > for the configuration of the pcs cluster.
> > You can see which scripts the nodes execute at the bottom of the
> > Vagrantfile.
> > While there is 'yum -y install sbd' on CentOS7 the fence_sbd
> agent has not
> > been packaged yet.
>
> you're not supposed to use it
>
> > Therefore I rebuild Fedora 24 package using the latest
> > https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz
> > plus the update to the fence_sbd from
> > https://github.com/ClusterLabs/fence-agents/pull/73
> >
> > The configuration is inspired by
> > https://www.novell.com/support/kb/doc.php?id=7009485 and
> >
> https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html
> >
> > Question 2:
> > After reading
> http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit I
> > expect with just one stonith resource configured
>
> there shouldn't be any stonith resources configured
>
>
> It's a test setup.
> Foundhttps://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html
>
> crm configure
> property stonith-enabled="true"
> property stonith-timeout="40s"
> primitive stonith_sbd stonith:external/sbd op start interval="0"
> timeout="15" start-delay="10"
> commit
> quit
For what is supported (self-fencing by watchdog) the stonith-resource is
just not needed because
of sbd and pacemaker interacting via cib.
>
>
> and trying to configure CentOS7 similarly.
>
>
>
> > a node will be fenced when I stop pacemaker and corosync `pcs
> cluster stop
> > node-1` or just `stonith_admin -F node-1`, but this is not the case.
> >
> > As can be seen below from uptime, the node-1 is not shutdown by
> `pcs cluster
> > stop node-1` executed on itself.
> > I found some discussions on users at clusterlabs.org
> <mailto:users at clusterlabs.org> about whether a node
> > running SBD resource can fence itself,
> > but the conclusion was not clear to me.
>
> on RHEL and derivatives it can ONLY fence itself. the disk based
> posion pill isn't supported yet
>
>
> once it's supported on RHEL I'll be ready :)
Meaning not supported in this case doesn't (just) mean that you will
receive - if at
all - very limited help, but that sbd is built with "--disable-shared-disk".
So unless you rebuild the package accordingly (with the other type of not
supported then ;-) ) testing with a block-device won't make much sense I
guess.
I'm already a little surprised that you get what you get ;-)
>
>
> >
> > Question 3:
> > Neither node-1 is fenced by `stonith_admin -F node-1` executed
> on node-2,
> > despite the fact
> > /var/log/messages on node-2 (the one currently running
> MyStonith) reporting:
> > ...
> > notice: Operation 'off' [3309] (call 2 from stonith_admin.3288)
> for host
> > 'node-1' with device 'MyStonith' returned: 0 (OK)
> > ...
> > What is happening here?
>
> have you tried looking at the sbd logs?
> is the watchdog device functioning correctly?
>
>
> it turned out (suggested here
> http://clusterlabs.org/pipermail/users/2016-June/003355.html) that the
> reason for node-1 not being fenced by `stonith_admin -F node-1`
> executed on node-2
> was the previously executed `pcs cluster stop node-1`. In my setup SBD
> seems integrated with corosync/pacemaker and the latter command
> stopped the sbd service on node-1.
> Killing corosync on node-1 instead of `pcs cluster stop node-1` fences
> node-1 as expected:
>
> [root at node-1 ~]# killall -15 corosync
> Broadcast message from systemd-journald at node-1 (Sat 2016-06-25
> 21:55:07 EDT):
> sbd[4761]: /dev/sdb1: emerg: do_exit: Rebooting system: off
>
> I'm left with further questions: how to setup fence_sbd for the fenced
> node to shutdown instead of reboot?
> Both action=off or mode=onoff action=off options passed to fence_sbd
> when creating the MyStonith resource result in a reboot.
>
> [root at node-2 ~]# pcs stonith show MyStonith
> Resource: MyStonith (class=stonith type=fence_sbd)
> Attributes: devices=/dev/sdb1 power_timeout=21 action=off
> Operations: monitor interval=60s (MyStonith-monitor-interval-60s)
>
> [root at node-2 ~]# pcs status
> Cluster name: mycluster
> Last updated: Tue Jun 28 04:55:43 2016 Last change: Tue Jun 28
> 04:48:03 2016 by root via cibadmin on node-1
> Stack: corosync
> Current DC: node-3 (version 1.1.13-10.el7_2.2-44eb2dd) - partition
> with quorum
> 3 nodes and 1 resource configured
>
> Online: [ node-1 node-2 node-3 ]
>
> Full list of resources:
>
> MyStonith (stonith:fence_sbd): Started node-2
>
> PCSD Status:
> node-1: Online
> node-2: Online
> node-3: Online
>
> Daemon Status:
> corosync: active/disabled
> pacemaker: active/disabled
> pcsd: active/enabled
>
> Starting from the above cluster state:
> [root at node-2 ~]# stonith_admin -F node-1
> results also in a reboot of node-1 instead of shutdown.
>
> /var/log/messages after the last command show "reboot" on node-2
> ...
> Jun 28 04:49:39 localhost stonith-ng[3081]: notice: Client
> stonith_admin.3179.fbc038ee wants to fence (off) 'node-1' with device
> '(any)'
> Jun 28 04:49:39 localhost stonith-ng[3081]: notice: Initiating remote
> operation off for node-1: 8aea4f12-538d-41ab-bf20-0c8b0f72e2a3 (0)
> Jun 28 04:49:39 localhost stonith-ng[3081]: notice: watchdog can not
> fence (off) node-1: static-list
> Jun 28 04:49:40 localhost stonith-ng[3081]: notice: MyStonith can
> fence (off) node-1: dynamic-list
> Jun 28 04:49:40 localhost stonith-ng[3081]: notice: watchdog can not
> fence (off) node-1: static-list
> Jun 28 04:49:44 localhost stonith-ng[3081]: notice:
> crm_update_peer_proc: Node node-1[1] - state is now lost (was member)
> Jun 28 04:49:44 localhost stonith-ng[3081]: notice: Removing node-1/1
> from the membership list
> Jun 28 04:49:44 localhost stonith-ng[3081]: notice: Purged 1 peers
> with id=1 and/or uname=node-1 from the membership cache
> Jun 28 04:49:45 localhost stonith-ng[3081]: notice: MyStonith can
> fence (reboot) node-1: dynamic-list
> Jun 28 04:49:45 localhost stonith-ng[3081]: notice: watchdog can not
> fence (reboot) node-1: static-list
> Jun 28 04:49:46 localhost stonith-ng[3081]: notice: Operation reboot
> of node-1 by node-3 for crmd.3063 at node-3.36859c4e: OK
> Jun 28 04:50:00 localhost stonith-ng[3081]: notice: Operation 'off'
> [3200] (call 2 from stonith_admin.3179) for host 'node-1' with device
> 'MyStonith' returned: 0 (OK)
> Jun 28 04:50:00 localhost stonith-ng[3081]: notice: Operation off of
> node-1 by node-2 for stonith_admin.3179 at node-2.8aea4f12: OK
> ...
>
>
> Another question (I think the question is valid also for a potential
> SUSE setup): What is the proper way of operating a cluster with SBD
> after node-1 was fenced?
>
> [root at node-2 ~]# sbd -d /dev/sdb1 list
> 0 node-3 clear
> 1 node-2 clear
> 2 node-1 off node-2
>
> I found that executing sbd watch on node-1 clears the SBD status:
> [root at node-1 ~]# sbd -d /dev/sdb1 watch
> [root at node-1 ~]# sbd -d /dev/sdb1 list
> 0 node-3 clear
> 1 node-2 clear
> 2 node-1 clear
> Making sure that sbd is not running on node-1 (I can do that because
> node-1 is currently not a part of the cluster)
> [root at node-1 ~]# killall -15 sbd
> I have to kill sbd because it's integrated with corosync and corosync
> fails to start on node-1 with sbd already running.
>
> I can now join node-1 to the cluster from node-2:
> [root at node-2 ~]# pcs cluster start node-1
>
>
> Marcin
>
>
> >
> > Question 4 (for the future):
> > Assuming the node-1 was fenced, what is the way of operating SBD?
> > I see the sbd lists now:
> > 0 node-3 clear
> > 1 node-1 off node-2
> > 2 node-2 clear
> > How to clear the status of node-1?
> >
> > Question 5 (also for the future):
> > While the relation 'stonith-timeout = Timeout (msgwait) + 20%'
> presented
> > at
> >
> https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html
> > is clearly described, I wonder about the relation of
> 'stonith-timeout'
> > to other timeouts like the 'monitor interval=60s' reported by
> `pcs stonith
> > show MyStonith`.
> >
> > Here is how I configure the cluster and test it. The run.sh
> script is
> > attached.
> >
> > $ sh -x run01.sh 2>&1 | tee run01.txt
> >
> > with the result:
> >
> > $ cat run01.txt
> >
> > Each block below shows the executed ssh command and the result.
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'pcs cluster auth -u hacluster -p
> password node-1
> > node-2 node-3'
> > node-1: Authorized
> > node-3: Authorized
> > node-2: Authorized
> >
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'pcs cluster setup --name mycluster
> node-1 node-2
> > node-3'
> > Shutting down pacemaker/corosync services...
> > Redirecting to /bin/systemctl stop pacemaker.service
> > Redirecting to /bin/systemctl stop corosync.service
> > Killing any remaining services...
> > Removing all cluster configuration files...
> > node-1: Succeeded
> > node-2: Succeeded
> > node-3: Succeeded
> > Synchronizing pcsd certificates on nodes node-1, node-2, node-3...
> > node-1: Success
> > node-3: Success
> > node-2: Success
> > Restaring pcsd on the nodes in order to reload the certificates...
> > node-1: Success
> > node-3: Success
> > node-2: Success
> >
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'pcs cluster start --all'
> > node-3: Starting Cluster...
> > node-2: Starting Cluster...
> > node-1: Starting Cluster...
> >
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'corosync-cfgtool -s'
> > Printing ring status.
> > Local node ID 1
> > RING ID 0
> > id = 192.168.10.11
> > status = ring 0 active with no faults
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'pcs status corosync'
> > Membership information
> > ----------------------
> > Nodeid Votes Name
> > 1 1 node-1 (local)
> > 2 1 node-2
> > 3 1 node-3
> >
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'pcs status'
> > Cluster name: mycluster
> > WARNING: no stonith devices and stonith-enabled is not false
> > Last updated: Sat Jun 25 15:40:51 2016 Last change: Sat
> Jun 25
> > 15:40:33 2016 by hacluster via crmd on node-2
> > Stack: corosync
> > Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) -
> partition with
> > quorum
> > 3 nodes and 0 resources configured
> > Online: [ node-1 node-2 node-3 ]
> > Full list of resources:
> > PCSD Status:
> > node-1: Online
> > node-2: Online
> > node-3: Online
> > Daemon Status:
> > corosync: active/disabled
> > pacemaker: active/disabled
> > pcsd: active/enabled
> >
> >
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list'
> > 0 node-3 clear
> > 1 node-2 clear
> > 2 node-1 clear
> >
> >
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 dump'
> > ==Dumping header on disk /dev/sdb1
> > Header version : 2.1
> > UUID : 79f28167-a207-4f2a-a723-aa1c00bf1dee
> > Number of slots : 255
> > Sector size : 512
> > Timeout (watchdog) : 10
> > Timeout (allocate) : 2
> > Timeout (loop) : 1
> > Timeout (msgwait) : 20
> > ==Header on disk /dev/sdb1 is dumped
> >
> >
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'pcs stonith list'
> > fence_sbd - Fence agent for sbd
> >
> >
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'pcs stonith create MyStonith fence_sbd
> > devices=/dev/sdb1 power_timeout=21 action=off'
> > ssh node-1 -c sudo su - -c 'pcs property set stonith-enabled=true'
> > ssh node-1 -c sudo su - -c 'pcs property set stonith-timeout=24s'
> > ssh node-1 -c sudo su - -c 'pcs property'
> > Cluster Properties:
> > cluster-infrastructure: corosync
> > cluster-name: mycluster
> > dc-version: 1.1.13-10.el7_2.2-44eb2dd
> > have-watchdog: true
> > stonith-enabled: true
> > stonith-timeout: 24s
> > stonith-watchdog-timeout: 10s
> >
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'pcs stonith show MyStonith'
> > Resource: MyStonith (class=stonith type=fence_sbd)
> > Attributes: devices=/dev/sdb1 power_timeout=21 action=off
> > Operations: monitor interval=60s (MyStonith-monitor-interval-60s)
> >
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'pcs cluster stop node-1 '
> > node-1: Stopping Cluster (pacemaker)...
> > node-1: Stopping Cluster (corosync)...
> >
> >
> >
> > ############################
> > ssh node-2 -c sudo su - -c 'pcs status'
> > Cluster name: mycluster
> > Last updated: Sat Jun 25 15:42:29 2016 Last change: Sat
> Jun 25
> > 15:41:09 2016 by root via cibadmin on node-1
> > Stack: corosync
> > Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) -
> partition with
> > quorum
> > 3 nodes and 1 resource configured
> > Online: [ node-2 node-3 ]
> > OFFLINE: [ node-1 ]
> > Full list of resources:
> > MyStonith (stonith:fence_sbd): Started node-2
> > PCSD Status:
> > node-1: Online
> > node-2: Online
> > node-3: Online
> > Daemon Status:
> > corosync: active/disabled
> > pacemaker: active/disabled
> > pcsd: active/enabled
> >
> >
> >
> > ############################
> > ssh node-2 -c sudo su - -c 'stonith_admin -F node-1 '
> >
> >
> >
> > ############################
> > ssh node-2 -c sudo su - -c 'grep stonith-ng /var/log/messages'
> > Jun 25 15:40:11 localhost stonith-ng[3102]: notice: Additional
> logging
> > available in /var/log/cluster/corosync.log
> > Jun 25 15:40:11 localhost stonith-ng[3102]: notice: Connecting
> to cluster
> > infrastructure: corosync
> > Jun 25 15:40:11 localhost stonith-ng[3102]: notice:
> crm_update_peer_proc:
> > Node node-2[2] - state is now member (was (null))
> > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: Watching
> for stonith
> > topology changes
> > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: Added
> 'watchdog' to the
> > device list (1 active devices)
> > Jun 25 15:40:12 localhost stonith-ng[3102]: notice:
> crm_update_peer_proc:
> > Node node-3[3] - state is now member (was (null))
> > Jun 25 15:40:12 localhost stonith-ng[3102]: notice:
> crm_update_peer_proc:
> > Node node-1[1] - state is now member (was (null))
> > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: New
> watchdog timeout
> > 10s (was 0s)
> > Jun 25 15:41:03 localhost stonith-ng[3102]: notice: Relying on
> watchdog
> > integration for fencing
> > Jun 25 15:41:04 localhost stonith-ng[3102]: notice: Added
> 'MyStonith' to
> > the device list (2 active devices)
> > Jun 25 15:41:54 localhost stonith-ng[3102]: notice:
> crm_update_peer_proc:
> > Node node-1[1] - state is now lost (was member)
> > Jun 25 15:41:54 localhost stonith-ng[3102]: notice: Removing
> node-1/1 from
> > the membership list
> > Jun 25 15:41:54 localhost stonith-ng[3102]: notice: Purged 1
> peers with
> > id=1 and/or uname=node-1 from the membership cache
> > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: Client
> > stonith_admin.3288.eb400ac9 wants to fence (off) 'node-1' with
> device
> > '(any)'
> > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: Initiating
> remote
> > operation off for node-1: 848cd1e9-55e4-4abc-8d7a-3762eaaf9ab4 (0)
> > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: watchdog
> can not fence
> > (off) node-1: static-list
> > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: MyStonith
> can fence
> > (off) node-1: dynamic-list
> > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: watchdog
> can not fence
> > (off) node-1: static-list
> > Jun 25 15:42:54 localhost stonith-ng[3102]: notice: Operation
> 'off' [3309]
> > (call 2 from stonith_admin.3288) for host 'node-1' with device
> 'MyStonith'
> > returned: 0 (OK)
> > Jun 25 15:42:54 localhost stonith-ng[3102]: notice: Operation
> off of node-1
> > by node-2 for stonith_admin.3288 at node-2.848cd1e9: OK
> > Jun 25 15:42:54 localhost stonith-ng[3102]: warning:
> new_event_notification
> > (3102-3288-12 <tel:%283102-3288-12>): Broken pipe (32)
> > Jun 25 15:42:54 localhost stonith-ng[3102]: warning: st_notify_fence
> > notification of client stonith_admin.3288.eb400a failed: Broken
> pipe (-32)
> >
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list'
> > 0 node-3 clear
> > 1 node-2 clear
> > 2 node-1 off node-2
> >
> >
> >
> > ############################
> > ssh node-1 -c sudo su - -c 'uptime'
> > 15:43:31 up 21 min, 2 users, load average: 0.25, 0.18, 0.11
> >
> >
> >
> > Cheers,
> >
> > Marcin
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> <mailto:Users at clusterlabs.org>
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> <mailto:Users at clusterlabs.org>
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list