[ClusterLabs] DLM fencing

Wed May 23 19:57:53 EDT 2018

I'm fairly new to clustering under Linux.  I've basically have one shared
storage resource  right now, using dlm, and gfs2.
I'm using fibre channel and when both of my nodes are up (2 node cluster)
dlm and gfs2 seem to be operating perfectly.
If I reboot node B, node A works fine and vice-versa.

When node B goes offline unexpectedly, and become unclean, dlm seems to
block all IO to the shared storage.

dlm knows node B is down:

# dlm_tool status
cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
daemon now 865695 fence_pid 18186
fence 1084772369 nodedown pid 18186 actor 1084772368 fail 1527119246 fence
0 now 1527119524
node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0 0

on the same server, I see these messages in my daemon.log
May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could not kick
(reboot) node 1084772369/(null) : No route to host (-113)
May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113 nodeid
1084772369

I can recover from the situation by forcing it (or bring the other node
back online)
dlm_tool fence_ack 1084772369

cluster config is pretty straighforward.
node 1084772368: alpha
node 1084772369: beta
primitive p_dlm_controld ocf:pacemaker:controld \
        op monitor interval=60 timeout=60 \
        meta target-role=Started \
        params args="-K -L -s 1"
primitive p_fs_gfs2 Filesystem \
        params device="/dev/sdb2" directory="/vms" fstype=gfs2
primitive stonith_sbd stonith:external/sbd \
        params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
        meta target-role=Started
group g_gfs2 p_dlm_controld p_fs_gfs2
clone cl_gfs2 g_gfs2 \
        meta interleave=true target-role=Started
location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.16-94ff4df \
        cluster-infrastructure=corosync \
        cluster-name=zeta \
        last-lrm-refresh=1525523370 \
        stonith-enabled=true \
        stonith-timeout=20s

Any pointers would be appreciated. I feel like this should be working but
I'm not sure if I've missed something.

Thanks,

Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180523/2e3d6ed8/attachment.html>