[ClusterLabs] DLM fencing

Thu May 24 04:19:29 UTC 2018

24.05.2018 02:57, Jason Gauthier пишет:
> I'm fairly new to clustering under Linux.  I've basically have one shared
> storage resource  right now, using dlm, and gfs2.
> I'm using fibre channel and when both of my nodes are up (2 node cluster)
> dlm and gfs2 seem to be operating perfectly.
> If I reboot node B, node A works fine and vice-versa.
> 
> When node B goes offline unexpectedly, and become unclean, dlm seems to
> block all IO to the shared storage.
> 
> dlm knows node B is down:
> 
> # dlm_tool status
> cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
> daemon now 865695 fence_pid 18186
> fence 1084772369 nodedown pid 18186 actor 1084772368 fail 1527119246 fence
> 0 now 1527119524
> node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
> node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0 0
> 
> on the same server, I see these messages in my daemon.log
> May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could not kick
> (reboot) node 1084772369/(null) : No route to host (-113)
> May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113 nodeid
> 1084772369
> 
> I can recover from the situation by forcing it (or bring the other node
> back online)
> dlm_tool fence_ack 1084772369
> 
> cluster config is pretty straighforward.
> node 1084772368: alpha
> node 1084772369: beta
> primitive p_dlm_controld ocf:pacemaker:controld \
>         op monitor interval=60 timeout=60 \
>         meta target-role=Started \
>         params args="-K -L -s 1"
> primitive p_fs_gfs2 Filesystem \
>         params device="/dev/sdb2" directory="/vms" fstype=gfs2
> primitive stonith_sbd stonith:external/sbd \
>         params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
>         meta target-role=Started

What is the status of stonith resource? Did you configure SBD fencing
properly?  Is sbd daemon up and running with proper parameters? What is
output of

sbd -d /dev/sdb1 dump
sbd -d /dev/sdb1 list

on both nodes? Does

sbd -d /dev/sdb1 message <other-node> test

work in both directions?

Does manual fencing using stonith_admin work?

> group g_gfs2 p_dlm_controld p_fs_gfs2
> clone cl_gfs2 g_gfs2 \
>         meta interleave=true target-role=Started
> location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
> property cib-bootstrap-options: \
>         have-watchdog=false \
>         dc-version=1.1.16-94ff4df \
>         cluster-infrastructure=corosync \
>         cluster-name=zeta \
>         last-lrm-refresh=1525523370 \
>         stonith-enabled=true \
>         stonith-timeout=20s
> 
> Any pointers would be appreciated. I feel like this should be working but
> I'm not sure if I've missed something.
> 
> Thanks,
> 
> Jason
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>