[ClusterLabs] DLM fencing

Thu May 24 14:03:04 UTC 2018

On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote:
> On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov <arvidjaar at gmail.c
> om> wrote:
> > 24.05.2018 02:57, Jason Gauthier пишет:
> > > I'm fairly new to clustering under Linux.  I've basically have
> > > one shared
> > > storage resource  right now, using dlm, and gfs2.
> > > I'm using fibre channel and when both of my nodes are up (2 node
> > > cluster)
> > > dlm and gfs2 seem to be operating perfectly.
> > > If I reboot node B, node A works fine and vice-versa.
> > > 
> > > When node B goes offline unexpectedly, and become unclean, dlm
> > > seems to
> > > block all IO to the shared storage.
> > > 
> > > dlm knows node B is down:
> > > 
> > > # dlm_tool status
> > > cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
> > > daemon now 865695 fence_pid 18186
> > > fence 1084772369 nodedown pid 18186 actor 1084772368 fail
> > > 1527119246 fence
> > > 0 now 1527119524
> > > node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
> > > node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0
> > > 0
> > > 
> > > on the same server, I see these messages in my daemon.log
> > > May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could
> > > not kick
> > > (reboot) node 1084772369/(null) : No route to host (-113)
> > > May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113
> > > nodeid
> > > 1084772369
> > > 
> > > I can recover from the situation by forcing it (or bring the
> > > other node
> > > back online)
> > > dlm_tool fence_ack 1084772369
> > > 
> > > cluster config is pretty straighforward.
> > > node 1084772368: alpha
> > > node 1084772369: beta
> > > primitive p_dlm_controld ocf:pacemaker:controld \
> > >         op monitor interval=60 timeout=60 \
> > >         meta target-role=Started \
> > >         params args="-K -L -s 1"
> > > primitive p_fs_gfs2 Filesystem \
> > >         params device="/dev/sdb2" directory="/vms" fstype=gfs2
> > > primitive stonith_sbd stonith:external/sbd \
> > >         params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
> > >         meta target-role=Started
> > 
> > What is the status of stonith resource? Did you configure SBD
> > fencing
> > properly?
> 
> I believe so.  It's shown above in my cluster config.
> 
> > Is sbd daemon up and running with proper parameters?
> 
> Well, no, apparently sbd isn't running.    With dlm, and gfs2, the
> cluster controls handling launching of the daemons.
> I assumed the same here, since the resource shows that it is up.

Unlike other services, sbd must be up before the cluster starts in
order for the cluster to use it properly. (Notice the "have-
watchdog=false" in your cib-bootstrap-options ... that means the
cluster didn't find sbd running.)

Also, even storage-based sbd requires a working hardware watchdog for
the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd should
list the watchdog device. Also sbd_device in your cluster config should
match SBD_DEVICE in /etc/sysconfig/sbd.

If you want the cluster to recover services elsewhere after a node
self-fences (which I'm sure you do), you also need to set the stonith-
watchdog-timeout cluster property to something greater than the value
of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will wait
that long and then assume the node fenced itself.

> 
> Online: [ alpha beta ]
> 
> Full list of resources:
> 
>  stonith_sbd    (stonith:external/sbd): Started alpha
>  Clone Set: cl_gfs2 [g_gfs2]
>      Started: [ alpha beta ]
> 
> 
> > What is output of
> > sbd -d /dev/sdb1 dump
> > sbd -d /dev/sdb1 list
> 
> Both nodes seem fine.
> 
> 0       alpha   test    beta
> 1       beta    test    alpha
> 
> 
> > on both nodes? Does
> > 
> > sbd -d /dev/sdb1 message <other-node> test
> > 
> > work in both directions?
> 
> It doesn't return an error, yet without a daemon running, I don't
> think the message is received either.
> 
> 
> > Does manual fencing using stonith_admin work?
> 
> I'm not sure at the moment.  I think I need to look into why the
> daemon isn't running.
> 
> > > group g_gfs2 p_dlm_controld p_fs_gfs2
> > > clone cl_gfs2 g_gfs2 \
> > >         meta interleave=true target-role=Started
> > > location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
> > > property cib-bootstrap-options: \
> > >         have-watchdog=false \
> > >         dc-version=1.1.16-94ff4df \
> > >         cluster-infrastructure=corosync \
> > >         cluster-name=zeta \
> > >         last-lrm-refresh=1525523370 \
> > >         stonith-enabled=true \
> > >         stonith-timeout=20s
> > > 
> > > Any pointers would be appreciated. I feel like this should be
> > > working but
> > > I'm not sure if I've missed something.
> > > 
> > > Thanks,
> > > 
> > > Jason
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Users mailing list: Users at clusterlabs.org
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> > > tch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > > 
> > 
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>