[ClusterLabs] STONITH when both IB interfaces are down, and how to trigger Filesystem mount/umount failure to test STONITH?

Thu Aug 20 10:46:00 UTC 2015

Hi, thanks for the answers,

i've performed the test of shutting down both IPoIB interfaces on an OSS
server
while a Lustre client writing a large file to the OST one that server, the
umount still succeded,
and writing to the file continued after a short delay on the same OST
mounted on the failed-over server.
I found however that if ones incorrectly formats Lustre OST (wrong index)
then it fails to mount,
and STONITH is triggered. I may test the "exit $OCF_ERR_GENERIC" solution,
but I would like to go back now to the first question: how can one trigger
STONITH
in case a server misses both IB interfaces? How to make it cooperate with
the existing
Filesystem mount based STONITH? Is it a good idea at all? Any examples in
the net?

Marcin

On Thu, Aug 20, 2015 at 9:00 AM, Andrei Borzenkov <arvidjaar at gmail.com>
wrote:

> 19.08.2015 13:31, Marcin Dulak пишет:
> > However if instead both IPoIB interfaces go down on server-02,
> > the mdt is moved to server-01, but no STONITH is performed on server-02.
> > This is expected, because there is nothing in the configuration that
> > triggers
> > STONITH in case of IB connection loss.
> > Hovever if IPoIB is flapping this setup could lead to mdt moving
> > back and forth between server-01 and server-02.
> > Should I have STONITH shutting down a node that misses both IpoIB
> > (remember they are passively redundant, only one active at a time)
> > interfaces?
>
> It is really up to the agent. Note that on-fail is triggered only if
> operation fails. So as long as stop invocation does not return error, no
> fencing happens.
>
> > If so, how to achieve that?
> >
>
> If you really want to trigger fencing when access to block device
> fails you probably need to define it as separate resource with own
> agent and set on-fail=fence on monitor operation for this block
> device. Otherwise you cannot really distinguish fiesystem level error
> from block device level.
>
> > The context for the second question: the configuration contains the
> > following Filesystem template:
> >
> > rsc_template lustre-target-template ocf:heartbeat:Filesystem \
> >    op monitor interval=120 timeout=60 OCF_CHECK_LEVEL=10 \
> >    op start   interval=0   timeout=300 on-fail=fence \
> >    op stop    interval=0   timeout=300 on-fail=fence
> >
> > How can I make umount/mount of Filesystem fail in order to test STONITH
> > action in these cases?
> >
>
> Insert "exit $OCF_ERR_GENERIC" in stop method? :)
>
> > Extra question: where can I find the documentation/source what
> > on-fail=fence is doing?
>
> Pacemaker Explained has some description. It should initiate fencing
> of node where resource had been active.
>
> > Or what does it mean on-fail=stop in the ethmonitor template below (what
> is
> > stopped?)?
> >
>
> on-fail=stop sets resource target role to stopped. So pacemaker tries
> to stop it and leave it stopped.
>
> > rsc_template netmonitor-30sec ethmonitor \
> >    params repeat_count=3 repeat_interval=10 \
> >    op monitor interval=15s timeout=60s \
> >    op start   interval=0s  timeout=60s on-fail=stop \
> >
> > Marcin
> >
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150820/cf7b1e80/attachment-0002.html>