[ClusterLabs] problems with a CentOS7 SBD cluster
Andrew Beekhof
abeekhof at redhat.com
Tue Jun 28 03:04:32 UTC 2016
On Sun, Jun 26, 2016 at 6:05 AM, Marcin Dulak <marcin.dulak at gmail.com> wrote:
> Hi,
>
> I'm trying to get familiar with STONITH Block Devices (SBD) on a 3-node
> CentOS7 built in VirtualBox.
> The complete setup is available at
> https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git
> so hopefully with some help I'll be able to make it work.
>
> Question 1:
> The shared device /dev/sbd1 is the VirtualBox's "shareable hard disk"
> https://www.virtualbox.org/manual/ch05.html#hdimagewrites
> will SBD fencing work with that type of storage?
unknown
>
> I start the cluster using vagrant_1.8.1 and virtualbox-4.3 with:
> $ vagrant up # takes ~15 minutes
>
> The setup brings up the nodes, installs the necessary packages, and prepares
> for the configuration of the pcs cluster.
> You can see which scripts the nodes execute at the bottom of the
> Vagrantfile.
> While there is 'yum -y install sbd' on CentOS7 the fence_sbd agent has not
> been packaged yet.
you're not supposed to use it
> Therefore I rebuild Fedora 24 package using the latest
> https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz
> plus the update to the fence_sbd from
> https://github.com/ClusterLabs/fence-agents/pull/73
>
> The configuration is inspired by
> https://www.novell.com/support/kb/doc.php?id=7009485 and
> https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html
>
> Question 2:
> After reading http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit I
> expect with just one stonith resource configured
there shouldn't be any stonith resources configured
> a node will be fenced when I stop pacemaker and corosync `pcs cluster stop
> node-1` or just `stonith_admin -F node-1`, but this is not the case.
>
> As can be seen below from uptime, the node-1 is not shutdown by `pcs cluster
> stop node-1` executed on itself.
> I found some discussions on users at clusterlabs.org about whether a node
> running SBD resource can fence itself,
> but the conclusion was not clear to me.
on RHEL and derivatives it can ONLY fence itself. the disk based
posion pill isn't supported yet
>
> Question 3:
> Neither node-1 is fenced by `stonith_admin -F node-1` executed on node-2,
> despite the fact
> /var/log/messages on node-2 (the one currently running MyStonith) reporting:
> ...
> notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) for host
> 'node-1' with device 'MyStonith' returned: 0 (OK)
> ...
> What is happening here?
have you tried looking at the sbd logs?
is the watchdog device functioning correctly?
>
> Question 4 (for the future):
> Assuming the node-1 was fenced, what is the way of operating SBD?
> I see the sbd lists now:
> 0 node-3 clear
> 1 node-1 off node-2
> 2 node-2 clear
> How to clear the status of node-1?
>
> Question 5 (also for the future):
> While the relation 'stonith-timeout = Timeout (msgwait) + 20%' presented
> at
> https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html
> is clearly described, I wonder about the relation of 'stonith-timeout'
> to other timeouts like the 'monitor interval=60s' reported by `pcs stonith
> show MyStonith`.
>
> Here is how I configure the cluster and test it. The run.sh script is
> attached.
>
> $ sh -x run01.sh 2>&1 | tee run01.txt
>
> with the result:
>
> $ cat run01.txt
>
> Each block below shows the executed ssh command and the result.
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs cluster auth -u hacluster -p password node-1
> node-2 node-3'
> node-1: Authorized
> node-3: Authorized
> node-2: Authorized
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs cluster setup --name mycluster node-1 node-2
> node-3'
> Shutting down pacemaker/corosync services...
> Redirecting to /bin/systemctl stop pacemaker.service
> Redirecting to /bin/systemctl stop corosync.service
> Killing any remaining services...
> Removing all cluster configuration files...
> node-1: Succeeded
> node-2: Succeeded
> node-3: Succeeded
> Synchronizing pcsd certificates on nodes node-1, node-2, node-3...
> node-1: Success
> node-3: Success
> node-2: Success
> Restaring pcsd on the nodes in order to reload the certificates...
> node-1: Success
> node-3: Success
> node-2: Success
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs cluster start --all'
> node-3: Starting Cluster...
> node-2: Starting Cluster...
> node-1: Starting Cluster...
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'corosync-cfgtool -s'
> Printing ring status.
> Local node ID 1
> RING ID 0
> id = 192.168.10.11
> status = ring 0 active with no faults
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs status corosync'
> Membership information
> ----------------------
> Nodeid Votes Name
> 1 1 node-1 (local)
> 2 1 node-2
> 3 1 node-3
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs status'
> Cluster name: mycluster
> WARNING: no stonith devices and stonith-enabled is not false
> Last updated: Sat Jun 25 15:40:51 2016 Last change: Sat Jun 25
> 15:40:33 2016 by hacluster via crmd on node-2
> Stack: corosync
> Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with
> quorum
> 3 nodes and 0 resources configured
> Online: [ node-1 node-2 node-3 ]
> Full list of resources:
> PCSD Status:
> node-1: Online
> node-2: Online
> node-3: Online
> Daemon Status:
> corosync: active/disabled
> pacemaker: active/disabled
> pcsd: active/enabled
>
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list'
> 0 node-3 clear
> 1 node-2 clear
> 2 node-1 clear
>
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 dump'
> ==Dumping header on disk /dev/sdb1
> Header version : 2.1
> UUID : 79f28167-a207-4f2a-a723-aa1c00bf1dee
> Number of slots : 255
> Sector size : 512
> Timeout (watchdog) : 10
> Timeout (allocate) : 2
> Timeout (loop) : 1
> Timeout (msgwait) : 20
> ==Header on disk /dev/sdb1 is dumped
>
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs stonith list'
> fence_sbd - Fence agent for sbd
>
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs stonith create MyStonith fence_sbd
> devices=/dev/sdb1 power_timeout=21 action=off'
> ssh node-1 -c sudo su - -c 'pcs property set stonith-enabled=true'
> ssh node-1 -c sudo su - -c 'pcs property set stonith-timeout=24s'
> ssh node-1 -c sudo su - -c 'pcs property'
> Cluster Properties:
> cluster-infrastructure: corosync
> cluster-name: mycluster
> dc-version: 1.1.13-10.el7_2.2-44eb2dd
> have-watchdog: true
> stonith-enabled: true
> stonith-timeout: 24s
> stonith-watchdog-timeout: 10s
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs stonith show MyStonith'
> Resource: MyStonith (class=stonith type=fence_sbd)
> Attributes: devices=/dev/sdb1 power_timeout=21 action=off
> Operations: monitor interval=60s (MyStonith-monitor-interval-60s)
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'pcs cluster stop node-1 '
> node-1: Stopping Cluster (pacemaker)...
> node-1: Stopping Cluster (corosync)...
>
>
>
> ############################
> ssh node-2 -c sudo su - -c 'pcs status'
> Cluster name: mycluster
> Last updated: Sat Jun 25 15:42:29 2016 Last change: Sat Jun 25
> 15:41:09 2016 by root via cibadmin on node-1
> Stack: corosync
> Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with
> quorum
> 3 nodes and 1 resource configured
> Online: [ node-2 node-3 ]
> OFFLINE: [ node-1 ]
> Full list of resources:
> MyStonith (stonith:fence_sbd): Started node-2
> PCSD Status:
> node-1: Online
> node-2: Online
> node-3: Online
> Daemon Status:
> corosync: active/disabled
> pacemaker: active/disabled
> pcsd: active/enabled
>
>
>
> ############################
> ssh node-2 -c sudo su - -c 'stonith_admin -F node-1 '
>
>
>
> ############################
> ssh node-2 -c sudo su - -c 'grep stonith-ng /var/log/messages'
> Jun 25 15:40:11 localhost stonith-ng[3102]: notice: Additional logging
> available in /var/log/cluster/corosync.log
> Jun 25 15:40:11 localhost stonith-ng[3102]: notice: Connecting to cluster
> infrastructure: corosync
> Jun 25 15:40:11 localhost stonith-ng[3102]: notice: crm_update_peer_proc:
> Node node-2[2] - state is now member (was (null))
> Jun 25 15:40:12 localhost stonith-ng[3102]: notice: Watching for stonith
> topology changes
> Jun 25 15:40:12 localhost stonith-ng[3102]: notice: Added 'watchdog' to the
> device list (1 active devices)
> Jun 25 15:40:12 localhost stonith-ng[3102]: notice: crm_update_peer_proc:
> Node node-3[3] - state is now member (was (null))
> Jun 25 15:40:12 localhost stonith-ng[3102]: notice: crm_update_peer_proc:
> Node node-1[1] - state is now member (was (null))
> Jun 25 15:40:12 localhost stonith-ng[3102]: notice: New watchdog timeout
> 10s (was 0s)
> Jun 25 15:41:03 localhost stonith-ng[3102]: notice: Relying on watchdog
> integration for fencing
> Jun 25 15:41:04 localhost stonith-ng[3102]: notice: Added 'MyStonith' to
> the device list (2 active devices)
> Jun 25 15:41:54 localhost stonith-ng[3102]: notice: crm_update_peer_proc:
> Node node-1[1] - state is now lost (was member)
> Jun 25 15:41:54 localhost stonith-ng[3102]: notice: Removing node-1/1 from
> the membership list
> Jun 25 15:41:54 localhost stonith-ng[3102]: notice: Purged 1 peers with
> id=1 and/or uname=node-1 from the membership cache
> Jun 25 15:42:33 localhost stonith-ng[3102]: notice: Client
> stonith_admin.3288.eb400ac9 wants to fence (off) 'node-1' with device
> '(any)'
> Jun 25 15:42:33 localhost stonith-ng[3102]: notice: Initiating remote
> operation off for node-1: 848cd1e9-55e4-4abc-8d7a-3762eaaf9ab4 (0)
> Jun 25 15:42:33 localhost stonith-ng[3102]: notice: watchdog can not fence
> (off) node-1: static-list
> Jun 25 15:42:33 localhost stonith-ng[3102]: notice: MyStonith can fence
> (off) node-1: dynamic-list
> Jun 25 15:42:33 localhost stonith-ng[3102]: notice: watchdog can not fence
> (off) node-1: static-list
> Jun 25 15:42:54 localhost stonith-ng[3102]: notice: Operation 'off' [3309]
> (call 2 from stonith_admin.3288) for host 'node-1' with device 'MyStonith'
> returned: 0 (OK)
> Jun 25 15:42:54 localhost stonith-ng[3102]: notice: Operation off of node-1
> by node-2 for stonith_admin.3288 at node-2.848cd1e9: OK
> Jun 25 15:42:54 localhost stonith-ng[3102]: warning: new_event_notification
> (3102-3288-12): Broken pipe (32)
> Jun 25 15:42:54 localhost stonith-ng[3102]: warning: st_notify_fence
> notification of client stonith_admin.3288.eb400a failed: Broken pipe (-32)
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list'
> 0 node-3 clear
> 1 node-2 clear
> 2 node-1 off node-2
>
>
>
> ############################
> ssh node-1 -c sudo su - -c 'uptime'
> 15:43:31 up 21 min, 2 users, load average: 0.25, 0.18, 0.11
>
>
>
> Cheers,
>
> Marcin
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list