[ClusterLabs] How to cancel a fencing request?

Tue Apr 3 22:35:43 UTC 2018

On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote:
> On 04/03/2018 05:43 PM, Ken Gaillot wrote:
> > On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote:
> > > On 04/02/2018 04:02 PM, Ken Gaillot wrote:
> > > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais
> > > > wrote:
> > > > > On Sun, 1 Apr 2018 09:01:15 +0300
> > > > > Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> > > > > 
> > > > > > 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет:
> > > > > > > Hi all,
> > > > > > > 
> > > > > > > I experienced a problem in a two node cluster. It has one
> > > > > > > FA
> > > > > > > per
> > > > > > > node and
> > > > > > > location constraints to avoid the node each of them are
> > > > > > > supposed
> > > > > > > to
> > > > > > > interrupt. 
> > > > > > 
> > > > > > If you mean stonith resource - for all I know location it
> > > > > > does
> > > > > > not
> > > > > > affect stonith operations and only changes where monitoring
> > > > > > action
> > > > > > is
> > > > > > performed.
> > > > > 
> > > > > Sure.
> > > > > 
> > > > > > You can create two stonith resources and declare that each
> > > > > > can fence only single node, but that is not location
> > > > > > constraint, it
> > > > > > is
> > > > > > resource configuration. Showing your configuration would be
> > > > > > helpflul to
> > > > > > avoid guessing.
> > > > > 
> > > > > True, I should have done that. A conf worth thousands of
> > > > > words :)
> > > > > 
> > > > >   crm conf<<EOC
> > > > > 
> > > > >   primitive fence_vm_srv1
> > > > > stonith:fence_virsh                   \
> > > > >     params pcmk_host_check="static-list"
> > > > > pcmk_host_list="srv1"  \
> > > > >            ipaddr="192.168.2.1"
> > > > > login="<user>"                  \
> > > > >            identity_file="/root/.ssh/id_rsa"                 
> > > > >    \
> > > > >            port="srv1-d8"
> > > > > action="off"                          \
> > > > >     op monitor interval=10s
> > > > > 
> > > > >   location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1
> > > > > 
> > > > >   primitive fence_vm_srv2
> > > > > stonith:fence_virsh                   \
> > > > >     params pcmk_host_check="static-list"
> > > > > pcmk_host_list="srv2"  \
> > > > >            ipaddr="192.168.2.1"
> > > > > login="<user>"                  \
> > > > >            identity_file="/root/.ssh/id_rsa"                 
> > > > >    \
> > > > >            port="srv2-d8"
> > > > > action="off"                          \
> > > > >     op monitor interval=10s
> > > > > 
> > > > >   location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2
> > > > >   
> > > > >   EOC
> > > > > 
> > > 
> > > -inf constraints like that should effectively prevent
> > > stonith-actions from being executed on that nodes.
> > 
> > It shouldn't ...
> > 
> > Pacemaker respects target-role=Started/Stopped for controlling
> > execution of fence devices, but location (or even whether the
> > device is
> > "running" at all) only affects monitors, not execution.
> > 
> > > Though there are a few issues with location constraints
> > > and stonith-devices.
> > > 
> > > When stonithd brings up the devices from the cib it
> > > runs the parts of pengine that fully evaluate these
> > > constraints and it would disable the stonith-device
> > > if the resource is unrunable on that node.
> > 
> > That should be true only for target-role, not everything that
> > affects
> > runnability
> 
> cib_device_update bails out via a removal of the device if
> - role == stopped
> - node not in allowed_nodes-list of stonith-resource
> - weight is negative
> 
> Wouldn't that include a -inf rule for a node?

Well, I'll be ... I thought I understood what was going on there. :-)
You're right.

I've frequently seen it recommended to ban fence devices from their
target when using one device per target. Perhaps it would be better to
give a lower (but positive) score on the target compared to the other
node(s), so it can be used when no other nodes are available.

> It is of course clear that no pengine-decision to start
> a stonith-resource is required for it to be used for
> fencing.
> 
> Regards,
> Klaus
> 
> > 
> > > But this part is not retriggered for location contraints
> > > with attributes or other content that would dynamically
> > > change. So one has to stick with constraints as simple
> > > and static as those in the example above.
> > > 
> > > Regarding adding/removing location constraints dynamically
> > > I remember a bug that should have got fixed round 1.1.18
> > > that led to improper handling and actually usage of
> > > stonith-devices disabled or banned from certain nodes.
> > > 
> > > Regards,
> > > Klaus
> > >  
> > > > > > > During some tests, a ms resource raised an error during
> > > > > > > the
> > > > > > > stop
> > > > > > > action on
> > > > > > > both nodes. So both nodes were supposed to be fenced.
> > > > > > 
> > > > > > In two-node cluster you can set pcmk_delay_max so that both
> > > > > > nodes
> > > > > > do not
> > > > > > attempt fencing simultaneously.
> > > > > 
> > > > > I'm not sure to understand the doc correctly in regard with
> > > > > this
> > > > > property. Does
> > > > > pcmk_delay_max delay the request itself or the execution of
> > > > > the
> > > > > request?
> > > > > 
> > > > > In other words, is it:
> > > > > 
> > > > >   delay -> fence query -> fencing action
> > > > > 
> > > > > or 
> > > > > 
> > > > >   fence query -> delay -> fence action
> > > > > 
> > > > > ?
> > > > > 
> > > > > The first definition would solve this issue, but not the
> > > > > second.
> > > > > As I
> > > > > understand it, as soon as the fence query has been sent, the
> > > > > node
> > > > > status is
> > > > > "UNCLEAN (online)".
> > > > 
> > > > The latter -- you're correct, the node is already unclean by
> > > > that
> > > > time.
> > > > Since the stop did not succeed, the node must be fenced to
> > > > continue
> > > > safely.
> > > 
> > > Well, pcmk_delay_base/max are made for the case
> > > where both nodes in a 2-node-cluster loose contact
> > > and see the respectively other as unclean.
> > > If the looser gets fenced it's view of the partner-
> > > node becomes irrelevant.
> > > 
> > > > > > > The first node did, but no FA was then able to fence the
> > > > > > > second
> > > > > > > one. So the
> > > > > > > node stayed DC and was reported as "UNCLEAN (online)".
> > > > > > > 
> > > > > > > We were able to fix the original ressource problem, but
> > > > > > > not
> > > > > > > to
> > > > > > > avoid the
> > > > > > > useless second node fencing.
> > > > > > > 
> > > > > > > My questions are:
> > > > > > > 
> > > > > > > 1. is it possible to cancel the fencing request 
> > > > > > > 2. is it possible reset the node status to "online" ? 
> > > > > > 
> > > > > > Not that I'm aware of.
> > > > > 
> > > > > Argh!
> > > > > 
> > > > > ++
> > > > 
> > > > You could fix the problem with the stopped service manually,
> > > > then
> > > > run
> > > > "stonith_admin --confirm=<NODENAME>" (or higher-level tool
> > > > equivalent).
> > > > That tells the cluster that you took care of the issue
> > > > yourself, so
> > > > fencing can be considered complete.
> > > > 
> > > > The catch there is that the cluster will assume you stopped the
> > > > node,
> > > > and all services on it are stopped. That could potentially
> > > > cause
> > > > some
> > > > headaches if it's not true. I'm guessing that if you unmanaged
> > > > all
> > > > the
> > > > resources on it first, then confirmed fencing, the cluster
> > > > would
> > > > detect
> > > > everything properly, then you could re-manage.
-- 
Ken Gaillot <kgaillot at redhat.com>