[ClusterLabs] How to cancel a fencing request?
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Tue Apr 3 15:33:53 EDT 2018
On Mon, 02 Apr 2018 09:02:24 -0500
Ken Gaillot <kgaillot at redhat.com> wrote:
> On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote:
> > On Sun, 1 Apr 2018 09:01:15 +0300
> > Andrei Borzenkov <arvidjaar at gmail.com> wrote:
[...]
> > > In two-node cluster you can set pcmk_delay_max so that both nodes
> > > do not
> > > attempt fencing simultaneously.
> >
> > I'm not sure to understand the doc correctly in regard with this
> > property. Does
> > pcmk_delay_max delay the request itself or the execution of the
> > request?
> >
> > In other words, is it:
> >
> > delay -> fence query -> fencing action
> >
> > or
> >
> > fence query -> delay -> fence action
> >
> > ?
> >
> > The first definition would solve this issue, but not the second. As I
> > understand it, as soon as the fence query has been sent, the node
> > status is
> > "UNCLEAN (online)".
>
> The latter -- you're correct, the node is already unclean by that time.
> Since the stop did not succeed, the node must be fenced to continue
> safely.
Thank you for this clarification.
Do you want to patch to add this clarification to the documentation ?
> > > > The first node did, but no FA was then able to fence the second
> > > > one. So the
> > > > node stayed DC and was reported as "UNCLEAN (online)".
> > > >
> > > > We were able to fix the original ressource problem, but not to
> > > > avoid the
> > > > useless second node fencing.
> > > >
> > > > My questions are:
> > > >
> > > > 1. is it possible to cancel the fencing request
> > > > 2. is it possible reset the node status to "online" ?
> > >
> > > Not that I'm aware of.
> >
> > Argh!
> >
> > ++
>
> You could fix the problem with the stopped service manually, then run
> "stonith_admin --confirm=<NODENAME>" (or higher-level tool equivalent).
> That tells the cluster that you took care of the issue yourself, so
> fencing can be considered complete.
Oh, OK. I was wondering if it could help.
For the complete story, while I was working on this cluster, we tried first to
"unfence" the node using "stonith_admin --unfence <nodename>"...and it actually
rebooted the node (using fence_vmware_soap) without cleaning its status??
...So we actually cleaned the status using "--confirm" after the complete
reboot.
Thank you for this clarification again.
> The catch there is that the cluster will assume you stopped the node,
> and all services on it are stopped. That could potentially cause some
> headaches if it's not true. I'm guessing that if you unmanaged all the
> resources on it first, then confirmed fencing, the cluster would detect
> everything properly, then you could re-manage.
Good to know. Thanks again.
More information about the Users
mailing list