[ClusterLabs] problems with a CentOS7 SBD cluster

Sun Jun 26 12:08:10 UTC 2016

Hi,

>* As can be seen below from uptime, the node-1 is not shutdown by `pcs
> *>* cluster stop node-1` executed on itself.
> *>* I found some discussions on users at clusterlabs.org <http://clusterlabs.org/mailman/listinfo/users> about whether a node
> *>* running SBD resource can fence itself,
> *>* but the conclusion was not clear to me.
> *>
> I am not familiar with pcs, but stopping pacemaker services manually
> makes node leave cluster in controlled manner, and does not result in
> fencing, at least in my experience.
>
> I confirm that killing corosync on node-1 results in fencing of node-1,
but in a reboot instead of my desired shutdown:
[root at node-1 ~]# killall -15 corosync
Broadcast message from systemd-journald at node-1 (Sat 2016-06-25 21:55:07
EDT):

sbd[4761]:  /dev/sdb1:    emerg: do_exit: Rebooting system: off

So the next is question 6: how to setup fence_sbd for the fenced node to
shutdown?
Both action=off or mode=onoff action=off options passed to fence_sbd when
creating the MyStonith resource result in a reboot.

[root at node-2 ~]# pcs stonith show MyStonith
 Resource: MyStonith (class=stonith type=fence_sbd)
  Attributes: devices=/dev/sdb1 power_timeout=21 action=off
  Operations: monitor interval=60s (MyStonith-monitor-interval-60s)

Another question (the question 4 from my first post): The cluster is now in
the state listed below.

[root at node-2 ~]# pcs status
Cluster name: mycluster
Last updated: Sat Jun 25 22:06:51 2016        Last change: Sat Jun 25
15:41:09 2016 by root via cibadmin on node-1
Stack: corosync
Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with
quorum
3 nodes and 1 resource configured

Online: [ node-2 node-3 ]
OFFLINE: [ node-1 ]

Full list of resources:

 MyStonith    (stonith:fence_sbd):    Started node-2

PCSD Status:
  node-1: Online
  node-2: Online
  node-3: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
[root at node-2 ~]# sbd -d /dev/sdb1 list
0    node-3    clear
1    node-2    clear
2    node-1    off    node-2

What is the proper way of operating a cluster with SBD?
I found that executing sbd watch on node-1 clears the SBD status:
[root at node-1 ~]# sbd -d /dev/sdb1 watch
[root at node-1 ~]# sbd -d /dev/sdb1 list
0    node-3    clear
1    node-2    clear
2    node-1    clear
After making sure that sbd is not running on node-1 (I can do that because
node-1 is currently not part of the cluster)
[root at node-1 ~]# killall -15 sbd
I can join node-1 to the cluster from node-2:
[root at node-2 ~]# pcs cluster start node-1

>* Question 3:
> *>* Neither node-1 is fenced by `stonith_admin -F node-1` executed on node-2,
> *>* despite the fact
> *>* /var/log/messages on node-2 (the one currently running MyStonith) reporting:
> *>* ...
> *>* notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) for host
> *>* 'node-1' with device 'MyStonith' returned: 0 (OK)
> *>* ...
> *>* What is happening here?
> *>
> Do you have sbd daemon running? SBD is based on self-fencing - the only
> thing that fence agent does is to place request for another node to kill
> itself. It is expected that sbd running on another node will respond to
> this request by committing suicide.
>
>
>
it looks to me that, as expected, sbd is integrated with corosync and
* by doing`pcs*

* cluster stop node-1` I stopped also sbd on node-1, so node-1 did not
respond to the fence request from node-2.*

*Now, back to question 6: with sbd running on node-1 and node-1 being part
of the cluster[root at node-2 ~]# stonith_admin -F node-1*

*results in a reboot of node-1 instead of shutdown.*

*/var/log/messages after the last command show "reboot" on node-2... Jun 25
22:36:46 localhost stonith-ng[3102]:  notice: Client crmd.3106.b61d09b8
wants to fence (reboot) 'node-1' with device '(any)'Jun 25 22:36:46
localhost stonith-ng[3102]:  notice: Initiating remote operation reboot for
node-1: f29ba740-4929-4755-a3f5-3aca9ff3c3ff (0)Jun 25 22:36:46 localhost
stonith-ng[3102]:  notice: MyStonith can fence (reboot) node-1:
dynamic-listJun 25 22:36:46 localhost stonith-ng[3102]:  notice: watchdog
can not fence (reboot) node-1: static-listJun 25 22:36:46 localhost
stonith-ng[3102]:  notice: MyStonith can fence (reboot) node-1:
dynamic-listJun 25 22:36:46 localhost stonith-ng[3102]:  notice: watchdog
can not fence (reboot) node-1: static-listJun 25 22:36:59 localhost
stonith-ng[3102]:  notice: Operation 'off' [10653] (call 2 from
stonith_admin.10640) for host 'node-1' with device 'MyStonith' returned: 0
(OK)Jun 25 22:36:59 localhost stonith-ng[3102]:  notice: Operation off of
node-1 by node-2 for stonith_admin.10640 at node-2.05923fc7: OKJun 25 22:37:00
localhost stonith-ng[3102]:  notice: Operation 'reboot' [10693] (call 4
from crmd.3106) for host 'node-1' with device 'MyStonith' returned: 0
(OK)Jun 25 22:37:00 localhost stonith-ng[3102]:  notice: Operation reboot
of node-1 by node-2 for crmd.3106 at node-2.f29ba740: OK...*

*This may seems strange, but when sbd is not running on node-1 I'm
consistently*

*getting "(off)" instread of "(reboot)" in node-2:/var/log/messages after
issuing:**[root at node-2 ~]# stonith_admin -F node-1*

*and in this case there is of course no response from node-1 to the fencing
request.*

*Cheers,*

*Marcin*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160626/d4cbabac/attachment.htm>