[ClusterLabs] DLM fencing

Thu May 24 20:57:41 EDT 2018

On Thu, May 24, 2018 at 10:40 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
> On Thu, 2018-05-24 at 16:14 +0200, Klaus Wenninger wrote:
>> On 05/24/2018 04:03 PM, Ken Gaillot wrote:
>> > On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote:
>> > > On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov <arvidjaar at gma
>> > > il.c
>> > > om> wrote:
>> > > > 24.05.2018 02:57, Jason Gauthier пишет:
>> > > > > I'm fairly new to clustering under Linux.  I've basically
>> > > > > have
>> > > > > one shared
>> > > > > storage resource  right now, using dlm, and gfs2.
>> > > > > I'm using fibre channel and when both of my nodes are up (2
>> > > > > node
>> > > > > cluster)
>> > > > > dlm and gfs2 seem to be operating perfectly.
>> > > > > If I reboot node B, node A works fine and vice-versa.
>> > > > >
>> > > > > When node B goes offline unexpectedly, and become unclean,
>> > > > > dlm
>> > > > > seems to
>> > > > > block all IO to the shared storage.
>> > > > >
>> > > > > dlm knows node B is down:
>> > > > >
>> > > > > # dlm_tool status
>> > > > > cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
>> > > > > daemon now 865695 fence_pid 18186
>> > > > > fence 1084772369 nodedown pid 18186 actor 1084772368 fail
>> > > > > 1527119246 fence
>> > > > > 0 now 1527119524
>> > > > > node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
>> > > > > node 1084772369 X add 865239 rem 865416 fail 865416 fence 0
>> > > > > at 0
>> > > > > 0
>> > > > >
>> > > > > on the same server, I see these messages in my daemon.log
>> > > > > May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick:
>> > > > > Could
>> > > > > not kick
>> > > > > (reboot) node 1084772369/(null) : No route to host (-113)
>> > > > > May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error
>> > > > > -113
>> > > > > nodeid
>> > > > > 1084772369
>> > > > >
>> > > > > I can recover from the situation by forcing it (or bring the
>> > > > > other node
>> > > > > back online)
>> > > > > dlm_tool fence_ack 1084772369
>> > > > >
>> > > > > cluster config is pretty straighforward.
>> > > > > node 1084772368: alpha
>> > > > > node 1084772369: beta
>> > > > > primitive p_dlm_controld ocf:pacemaker:controld \
>> > > > >         op monitor interval=60 timeout=60 \
>> > > > >         meta target-role=Started \
>> > > > >         params args="-K -L -s 1"
>> > > > > primitive p_fs_gfs2 Filesystem \
>> > > > >         params device="/dev/sdb2" directory="/vms"
>> > > > > fstype=gfs2
>> > > > > primitive stonith_sbd stonith:external/sbd \
>> > > > >         params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
>> > > > >         meta target-role=Started
>> > > >
>> > > > What is the status of stonith resource? Did you configure SBD
>> > > > fencing
>> > > > properly?
>> > >
>> > > I believe so.  It's shown above in my cluster config.
>> > >
>> > > > Is sbd daemon up and running with proper parameters?
>> > >
>> > > Well, no, apparently sbd isn't running.    With dlm, and gfs2,
>> > > the
>> > > cluster controls handling launching of the daemons.
>> > > I assumed the same here, since the resource shows that it is up.
>> >
>> > Unlike other services, sbd must be up before the cluster starts in
>> > order for the cluster to use it properly. (Notice the "have-
>> > watchdog=false" in your cib-bootstrap-options ... that means the
>> > cluster didn't find sbd running.)
>> >
>> > Also, even storage-based sbd requires a working hardware watchdog
>> > for
>> > the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd
>> > should
>> > list the watchdog device. Also sbd_device in your cluster config
>> > should
>> > match SBD_DEVICE in /etc/sysconfig/sbd.
>> >
>> > If you want the cluster to recover services elsewhere after a node
>> > self-fences (which I'm sure you do), you also need to set the
>> > stonith-
>> > watchdog-timeout cluster property to something greater than the
>> > value
>> > of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will
>> > wait
>> > that long and then assume the node fenced itself.

Thanks.  So, for whatever reason, sbd was not running. I went ahead
and got /etc/default/sbd (debian) configured.
I can't start the service manually due to dependencies, but I rebooted
node B and it came up.
Node A would not, I ended up rebooting both nodes at the same time,
and sbd was running on both.

I forced a failure of node B, and after a few seconds node A was able
to access the shared storage.
Definite improvement!

>> Actually for the case that there is a shared disk a successful
>> fencing-attempt via the sbd-fencing-resource should be enough
>> for the node to be assumed down.
>> In case of a 2-node-setup I would even discourage setting
>> stonith-watchdog-timeout as we need a real quorum-mechanism
>> for that to work.
>
> Ah, thanks -- I've updated the wiki how-to, feel free to clarify
> further:
>
> https://wiki.clusterlabs.org/wiki/Using_SBD_with_Pacemaker
>
>>
>> Regards,
>> Klaus
>>
>> >
>> > > Online: [ alpha beta ]
>> > >
>> > > Full list of resources:
>> > >
>> > >  stonith_sbd    (stonith:external/sbd): Started alpha
>> > >  Clone Set: cl_gfs2 [g_gfs2]
>> > >      Started: [ alpha beta ]
>> > >
>> > >
>> > > > What is output of
>> > > > sbd -d /dev/sdb1 dump
>> > > > sbd -d /dev/sdb1 list
>> > >
>> > > Both nodes seem fine.
>> > >
>> > > 0       alpha   test    beta
>> > > 1       beta    test    alpha
>> > >
>> > >
>> > > > on both nodes? Does
>> > > >
>> > > > sbd -d /dev/sdb1 message <other-node> test
>> > > >
>> > > > work in both directions?
>> > >
>> > > It doesn't return an error, yet without a daemon running, I don't
>> > > think the message is received either.
>> > >
>> > >
>> > > > Does manual fencing using stonith_admin work?
>> > >
>> > > I'm not sure at the moment.  I think I need to look into why the
>> > > daemon isn't running.
>> > >
>> > > > > group g_gfs2 p_dlm_controld p_fs_gfs2
>> > > > > clone cl_gfs2 g_gfs2 \
>> > > > >         meta interleave=true target-role=Started
>> > > > > location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
>> > > > > property cib-bootstrap-options: \
>> > > > >         have-watchdog=false \
>> > > > >         dc-version=1.1.16-94ff4df \
>> > > > >         cluster-infrastructure=corosync \
>> > > > >         cluster-name=zeta \
>> > > > >         last-lrm-refresh=1525523370 \
>> > > > >         stonith-enabled=true \
>> > > > >         stonith-timeout=20s
>> > > > >
>> > > > > Any pointers would be appreciated. I feel like this should be
>> > > > > working but
>> > > > > I'm not sure if I've missed something.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jason
>> > > > >
>> > > > >
>> > > > >
>> > > > > _______________________________________________
>> > > > > Users mailing list: Users at clusterlabs.org
>> > > > > https://lists.clusterlabs.org/mailman/listinfo/users
>> > > > >
>> > > > > Project Home: http://www.clusterlabs.org
>> > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_
>> > > > > Scra
>> > > > > tch.pdf
>> > > > > Bugs: http://bugs.clusterlabs.org
>> > > > >
>> > > >
>> > > > _______________________________________________
>> > > > Users mailing list: Users at clusterlabs.org
>> > > > https://lists.clusterlabs.org/mailman/listinfo/users
>> > > >
>> > > > Project Home: http://www.clusterlabs.org
>> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc
>> > > > ratc
>> > > > h.pdf
>> > > > Bugs: http://bugs.clusterlabs.org
>> > >
>> > > _______________________________________________
>> > > Users mailing list: Users at clusterlabs.org
>> > > https://lists.clusterlabs.org/mailman/listinfo/users
>> > >
>> > > Project Home: http://www.clusterlabs.org
>> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
>> > > tch.
>> > > pdf
>> > > Bugs: http://bugs.clusterlabs.org
>>
>>
> --
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org