[ClusterLabs] STONITH when both IB interfaces are down, and how to trigger Filesystem mount/umount failure to test STONITH?

Andrei Borzenkov arvidjaar at gmail.com
Thu Aug 20 03:00:32 EDT 2015

19.08.2015 13:31, Marcin Dulak пишет:
> However if instead both IPoIB interfaces go down on server-02,
> the mdt is moved to server-01, but no STONITH is performed on server-02.
> This is expected, because there is nothing in the configuration that
> triggers
> STONITH in case of IB connection loss.
> Hovever if IPoIB is flapping this setup could lead to mdt moving
> back and forth between server-01 and server-02.
> Should I have STONITH shutting down a node that misses both IpoIB
> (remember they are passively redundant, only one active at a time)
> interfaces?

It is really up to the agent. Note that on-fail is triggered only if
operation fails. So as long as stop invocation does not return error, no
fencing happens.

> If so, how to achieve that?

If you really want to trigger fencing when access to block device
fails you probably need to define it as separate resource with own
agent and set on-fail=fence on monitor operation for this block
device. Otherwise you cannot really distinguish fiesystem level error
from block device level.

> The context for the second question: the configuration contains the
> following Filesystem template:
> rsc_template lustre-target-template ocf:heartbeat:Filesystem \
>    op monitor interval=120 timeout=60 OCF_CHECK_LEVEL=10 \
>    op start   interval=0   timeout=300 on-fail=fence \
>    op stop    interval=0   timeout=300 on-fail=fence
> How can I make umount/mount of Filesystem fail in order to test STONITH
> action in these cases?

Insert "exit $OCF_ERR_GENERIC" in stop method? :)

> Extra question: where can I find the documentation/source what
> on-fail=fence is doing?

Pacemaker Explained has some description. It should initiate fencing
of node where resource had been active.

> Or what does it mean on-fail=stop in the ethmonitor template below (what is
> stopped?)?

on-fail=stop sets resource target role to stopped. So pacemaker tries
to stop it and leave it stopped.

> rsc_template netmonitor-30sec ethmonitor \
>    params repeat_count=3 repeat_interval=10 \
>    op monitor interval=15s timeout=60s \
>    op start   interval=0s  timeout=60s on-fail=stop \
> Marcin
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

More information about the Users mailing list