[ClusterLabs] How to cancel a fencing request?

Mon Apr 2 14:02:24 UTC 2018

On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote:
> On Sun, 1 Apr 2018 09:01:15 +0300
> Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> 
> > 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет:
> > > Hi all,
> > > 
> > > I experienced a problem in a two node cluster. It has one FA per
> > > node and
> > > location constraints to avoid the node each of them are supposed
> > > to
> > > interrupt. 
> > 
> > If you mean stonith resource - for all I know location it does not
> > affect stonith operations and only changes where monitoring action
> > is
> > performed.
> 
> Sure.
> 
> > You can create two stonith resources and declare that each
> > can fence only single node, but that is not location constraint, it
> > is
> > resource configuration. Showing your configuration would be
> > helpflul to
> > avoid guessing.
> 
> True, I should have done that. A conf worth thousands of words :)
> 
>   crm conf<<EOC
> 
>   primitive fence_vm_srv1 stonith:fence_virsh                   \
>     params pcmk_host_check="static-list" pcmk_host_list="srv1"  \
>            ipaddr="192.168.2.1" login="<user>"                  \
>            identity_file="/root/.ssh/id_rsa"                    \
>            port="srv1-d8" action="off"                          \
>     op monitor interval=10s
> 
>   location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1
> 
>   primitive fence_vm_srv2 stonith:fence_virsh                   \
>     params pcmk_host_check="static-list" pcmk_host_list="srv2"  \
>            ipaddr="192.168.2.1" login="<user>"                  \
>            identity_file="/root/.ssh/id_rsa"                    \
>            port="srv2-d8" action="off"                          \
>     op monitor interval=10s
> 
>   location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2
>   
>   EOC
> 
> 
> > > During some tests, a ms resource raised an error during the stop
> > > action on
> > > both nodes. So both nodes were supposed to be fenced.
> > 
> > In two-node cluster you can set pcmk_delay_max so that both nodes
> > do not
> > attempt fencing simultaneously.
> 
> I'm not sure to understand the doc correctly in regard with this
> property. Does
> pcmk_delay_max delay the request itself or the execution of the
> request?
> 
> In other words, is it:
> 
>   delay -> fence query -> fencing action
> 
> or 
> 
>   fence query -> delay -> fence action
> 
> ?
> 
> The first definition would solve this issue, but not the second. As I
> understand it, as soon as the fence query has been sent, the node
> status is
> "UNCLEAN (online)".

The latter -- you're correct, the node is already unclean by that time.
Since the stop did not succeed, the node must be fenced to continue
safely.

> > > The first node did, but no FA was then able to fence the second
> > > one. So the
> > > node stayed DC and was reported as "UNCLEAN (online)".
> > > 
> > > We were able to fix the original ressource problem, but not to
> > > avoid the
> > > useless second node fencing.
> > > 
> > > My questions are:
> > > 
> > > 1. is it possible to cancel the fencing request 
> > > 2. is it possible reset the node status to "online" ? 
> > 
> > Not that I'm aware of.
> 
> Argh!
> 
> ++

You could fix the problem with the stopped service manually, then run
"stonith_admin --confirm=<NODENAME>" (or higher-level tool equivalent).
That tells the cluster that you took care of the issue yourself, so
fencing can be considered complete.

The catch there is that the cluster will assume you stopped the node,
and all services on it are stopped. That could potentially cause some
headaches if it's not true. I'm guessing that if you unmanaged all the
resources on it first, then confirmed fencing, the cluster would detect
everything properly, then you could re-manage.
-- 
Ken Gaillot <kgaillot at redhat.com>