[ClusterLabs] DLM fencing

Thu May 24 14:14:45 UTC 2018

On 05/24/2018 04:03 PM, Ken Gaillot wrote:
> On Thu, 2018-05-24 at 06:47 -0400, Jason Gauthier wrote:
>> On Thu, May 24, 2018 at 12:19 AM, Andrei Borzenkov <arvidjaar at gmail.c
>> om> wrote:
>>> 24.05.2018 02:57, Jason Gauthier пишет:
>>>> I'm fairly new to clustering under Linux.  I've basically have
>>>> one shared
>>>> storage resource  right now, using dlm, and gfs2.
>>>> I'm using fibre channel and when both of my nodes are up (2 node
>>>> cluster)
>>>> dlm and gfs2 seem to be operating perfectly.
>>>> If I reboot node B, node A works fine and vice-versa.
>>>>
>>>> When node B goes offline unexpectedly, and become unclean, dlm
>>>> seems to
>>>> block all IO to the shared storage.
>>>>
>>>> dlm knows node B is down:
>>>>
>>>> # dlm_tool status
>>>> cluster nodeid 1084772368 quorate 1 ring seq 32644 32644
>>>> daemon now 865695 fence_pid 18186
>>>> fence 1084772369 nodedown pid 18186 actor 1084772368 fail
>>>> 1527119246 fence
>>>> 0 now 1527119524
>>>> node 1084772368 M add 861439 rem 0 fail 0 fence 0 at 0 0
>>>> node 1084772369 X add 865239 rem 865416 fail 865416 fence 0 at 0
>>>> 0
>>>>
>>>> on the same server, I see these messages in my daemon.log
>>>> May 23 19:52:47 alpha stonith-api[18186]: stonith_api_kick: Could
>>>> not kick
>>>> (reboot) node 1084772369/(null) : No route to host (-113)
>>>> May 23 19:52:47 alpha dlm_stonith[18186]: kick_helper error -113
>>>> nodeid
>>>> 1084772369
>>>>
>>>> I can recover from the situation by forcing it (or bring the
>>>> other node
>>>> back online)
>>>> dlm_tool fence_ack 1084772369
>>>>
>>>> cluster config is pretty straighforward.
>>>> node 1084772368: alpha
>>>> node 1084772369: beta
>>>> primitive p_dlm_controld ocf:pacemaker:controld \
>>>>         op monitor interval=60 timeout=60 \
>>>>         meta target-role=Started \
>>>>         params args="-K -L -s 1"
>>>> primitive p_fs_gfs2 Filesystem \
>>>>         params device="/dev/sdb2" directory="/vms" fstype=gfs2
>>>> primitive stonith_sbd stonith:external/sbd \
>>>>         params pcmk_delay_max=30 sbd_device="/dev/sdb1" \
>>>>         meta target-role=Started
>>> What is the status of stonith resource? Did you configure SBD
>>> fencing
>>> properly?
>> I believe so.  It's shown above in my cluster config.
>>
>>> Is sbd daemon up and running with proper parameters?
>> Well, no, apparently sbd isn't running.    With dlm, and gfs2, the
>> cluster controls handling launching of the daemons.
>> I assumed the same here, since the resource shows that it is up.
> Unlike other services, sbd must be up before the cluster starts in
> order for the cluster to use it properly. (Notice the "have-
> watchdog=false" in your cib-bootstrap-options ... that means the
> cluster didn't find sbd running.)
>
> Also, even storage-based sbd requires a working hardware watchdog for
> the actual self-fencing. SBD_WATCHDOG_DEV in /etc/sysconfig/sbd should
> list the watchdog device. Also sbd_device in your cluster config should
> match SBD_DEVICE in /etc/sysconfig/sbd.
>
> If you want the cluster to recover services elsewhere after a node
> self-fences (which I'm sure you do), you also need to set the stonith-
> watchdog-timeout cluster property to something greater than the value
> of SBD_WATCHDOG_TIMEOUT in /etc/sysconfig/sbd. The cluster will wait
> that long and then assume the node fenced itself.

Actually for the case that there is a shared disk a successful
fencing-attempt via the sbd-fencing-resource should be enough
for the node to be assumed down.
In case of a 2-node-setup I would even discourage setting
stonith-watchdog-timeout as we need a real quorum-mechanism
for that to work.

Regards,
Klaus

>
>> Online: [ alpha beta ]
>>
>> Full list of resources:
>>
>>  stonith_sbd    (stonith:external/sbd): Started alpha
>>  Clone Set: cl_gfs2 [g_gfs2]
>>      Started: [ alpha beta ]
>>
>>
>>> What is output of
>>> sbd -d /dev/sdb1 dump
>>> sbd -d /dev/sdb1 list
>> Both nodes seem fine.
>>
>> 0       alpha   test    beta
>> 1       beta    test    alpha
>>
>>
>>> on both nodes? Does
>>>
>>> sbd -d /dev/sdb1 message <other-node> test
>>>
>>> work in both directions?
>> It doesn't return an error, yet without a daemon running, I don't
>> think the message is received either.
>>
>>
>>> Does manual fencing using stonith_admin work?
>> I'm not sure at the moment.  I think I need to look into why the
>> daemon isn't running.
>>
>>>> group g_gfs2 p_dlm_controld p_fs_gfs2
>>>> clone cl_gfs2 g_gfs2 \
>>>>         meta interleave=true target-role=Started
>>>> location cli-prefer-cl_gfs2 cl_gfs2 role=Started inf: alpha
>>>> property cib-bootstrap-options: \
>>>>         have-watchdog=false \
>>>>         dc-version=1.1.16-94ff4df \
>>>>         cluster-infrastructure=corosync \
>>>>         cluster-name=zeta \
>>>>         last-lrm-refresh=1525523370 \
>>>>         stonith-enabled=true \
>>>>         stonith-timeout=20s
>>>>
>>>> Any pointers would be appreciated. I feel like this should be
>>>> working but
>>>> I'm not sure if I've missed something.
>>>>
>>>> Thanks,
>>>>
>>>> Jason
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
>>>> tch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
>>> h.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> pdf
>> Bugs: http://bugs.clusterlabs.org