[ClusterLabs] How to cancel a fencing request?

Tue Apr 3 19:46:18 UTC 2018

On 04/03/2018 05:43 PM, Ken Gaillot wrote:
> On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote:
>> On 04/02/2018 04:02 PM, Ken Gaillot wrote:
>>> On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais
>>> wrote:
>>>> On Sun, 1 Apr 2018 09:01:15 +0300
>>>> Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>>>>
>>>>> 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет:
>>>>>> Hi all,
>>>>>>
>>>>>> I experienced a problem in a two node cluster. It has one FA
>>>>>> per
>>>>>> node and
>>>>>> location constraints to avoid the node each of them are
>>>>>> supposed
>>>>>> to
>>>>>> interrupt. 
>>>>> If you mean stonith resource - for all I know location it does
>>>>> not
>>>>> affect stonith operations and only changes where monitoring
>>>>> action
>>>>> is
>>>>> performed.
>>>> Sure.
>>>>
>>>>> You can create two stonith resources and declare that each
>>>>> can fence only single node, but that is not location
>>>>> constraint, it
>>>>> is
>>>>> resource configuration. Showing your configuration would be
>>>>> helpflul to
>>>>> avoid guessing.
>>>> True, I should have done that. A conf worth thousands of words :)
>>>>
>>>>   crm conf<<EOC
>>>>
>>>>   primitive fence_vm_srv1 stonith:fence_virsh                   \
>>>>     params pcmk_host_check="static-list" pcmk_host_list="srv1"  \
>>>>            ipaddr="192.168.2.1" login="<user>"                  \
>>>>            identity_file="/root/.ssh/id_rsa"                    \
>>>>            port="srv1-d8" action="off"                          \
>>>>     op monitor interval=10s
>>>>
>>>>   location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1
>>>>
>>>>   primitive fence_vm_srv2 stonith:fence_virsh                   \
>>>>     params pcmk_host_check="static-list" pcmk_host_list="srv2"  \
>>>>            ipaddr="192.168.2.1" login="<user>"                  \
>>>>            identity_file="/root/.ssh/id_rsa"                    \
>>>>            port="srv2-d8" action="off"                          \
>>>>     op monitor interval=10s
>>>>
>>>>   location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2
>>>>   
>>>>   EOC
>>>>
>> -inf constraints like that should effectively prevent
>> stonith-actions from being executed on that nodes.
> It shouldn't ...
>
> Pacemaker respects target-role=Started/Stopped for controlling
> execution of fence devices, but location (or even whether the device is
> "running" at all) only affects monitors, not execution.
>
>> Though there are a few issues with location constraints
>> and stonith-devices.
>>
>> When stonithd brings up the devices from the cib it
>> runs the parts of pengine that fully evaluate these
>> constraints and it would disable the stonith-device
>> if the resource is unrunable on that node.
> That should be true only for target-role, not everything that affects
> runnability

cib_device_update bails out via a removal of the device if
- role == stopped
- node not in allowed_nodes-list of stonith-resource
- weight is negative

Wouldn't that include a -inf rule for a node?

It is of course clear that no pengine-decision to start
a stonith-resource is required for it to be used for
fencing.

Regards,
Klaus

>
>> But this part is not retriggered for location contraints
>> with attributes or other content that would dynamically
>> change. So one has to stick with constraints as simple
>> and static as those in the example above.
>>
>> Regarding adding/removing location constraints dynamically
>> I remember a bug that should have got fixed round 1.1.18
>> that led to improper handling and actually usage of
>> stonith-devices disabled or banned from certain nodes.
>>
>> Regards,
>> Klaus
>>  
>>>>>> During some tests, a ms resource raised an error during the
>>>>>> stop
>>>>>> action on
>>>>>> both nodes. So both nodes were supposed to be fenced.
>>>>> In two-node cluster you can set pcmk_delay_max so that both
>>>>> nodes
>>>>> do not
>>>>> attempt fencing simultaneously.
>>>> I'm not sure to understand the doc correctly in regard with this
>>>> property. Does
>>>> pcmk_delay_max delay the request itself or the execution of the
>>>> request?
>>>>
>>>> In other words, is it:
>>>>
>>>>   delay -> fence query -> fencing action
>>>>
>>>> or 
>>>>
>>>>   fence query -> delay -> fence action
>>>>
>>>> ?
>>>>
>>>> The first definition would solve this issue, but not the second.
>>>> As I
>>>> understand it, as soon as the fence query has been sent, the node
>>>> status is
>>>> "UNCLEAN (online)".
>>> The latter -- you're correct, the node is already unclean by that
>>> time.
>>> Since the stop did not succeed, the node must be fenced to continue
>>> safely.
>> Well, pcmk_delay_base/max are made for the case
>> where both nodes in a 2-node-cluster loose contact
>> and see the respectively other as unclean.
>> If the looser gets fenced it's view of the partner-
>> node becomes irrelevant.
>>
>>>>>> The first node did, but no FA was then able to fence the
>>>>>> second
>>>>>> one. So the
>>>>>> node stayed DC and was reported as "UNCLEAN (online)".
>>>>>>
>>>>>> We were able to fix the original ressource problem, but not
>>>>>> to
>>>>>> avoid the
>>>>>> useless second node fencing.
>>>>>>
>>>>>> My questions are:
>>>>>>
>>>>>> 1. is it possible to cancel the fencing request 
>>>>>> 2. is it possible reset the node status to "online" ? 
>>>>> Not that I'm aware of.
>>>> Argh!
>>>>
>>>> ++
>>> You could fix the problem with the stopped service manually, then
>>> run
>>> "stonith_admin --confirm=<NODENAME>" (or higher-level tool
>>> equivalent).
>>> That tells the cluster that you took care of the issue yourself, so
>>> fencing can be considered complete.
>>>
>>> The catch there is that the cluster will assume you stopped the
>>> node,
>>> and all services on it are stopped. That could potentially cause
>>> some
>>> headaches if it's not true. I'm guessing that if you unmanaged all
>>> the
>>> resources on it first, then confirmed fencing, the cluster would
>>> detect
>>> everything properly, then you could re-manage.
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> pdf
>> Bugs: http://bugs.clusterlabs.org