[ClusterLabs] How is fencing and unfencing suppose to work?

Fri Sep 28 13:11:38 EDT 2018

On 2018-09-04 8:49 p.m., Ken Gaillot wrote:
> On Tue, 2018-08-21 at 10:23 -0500, Ryan Thomas wrote:
>> I’m seeing unexpected behavior when using “unfencing” – I don’t think
>> I’m understanding it correctly.  I configured a resource that
>> “requires unfencing” and have a custom fencing agent which “provides
>> unfencing”.   I perform a simple test where I setup the cluster and
>> then run “pcs stonith fence node2”, and I see that node2 is
>> successfully fenced by sending an “off” action to my fencing agent.
>> But, immediately after this, I see an “on” action sent to my fencing
>> agent.  My fence agent doesn’t implement the “reboot” action, so
>> perhaps its trying to reboot by running an off action followed by a
>> on action.  Prior to adding “provides unfencing” to the fencing
>> agent, I didn’t see the on action. It seems unsafe to say “node2 you
>> can’t run” and then immediately “ you can run”.
> I'm not as familiar with unfencing as I'd like, but I believe the basic
> idea is:
>
> - the fence agent's off action cuts the machine off from something
> essential needed to run resources (generally shared storage or network
> access)
>
> - the fencing works such that a fenced host is not able to request
> rejoining the cluster without manual intervention by a sysadmin
>
> - when the sysadmin allows the host back into the cluster, and it
> contacts the other nodes to rejoin, the cluster will call the fence
> agent's on action, which is expected to re-enable the host's access
>
> How that works in practice, I have only vague knowledge.

This is correct. Consider fabric fencing where fiber channel ports are 
disconnected. Unfence restores the connection. Similar to a pure 'off' 
fence call to switched PDUs, as you mention above. Unfence powers the 
outlets back up.

>> I don’t think I’m understanding this aspect of fencing/stonith.  I
>> thought that the fence agent acted as a proxy to a node, when the
>> node was fenced, it was isolated from shared storage by some means
>> (power, fabric, etc).  It seems like it shouldn’t become unfenced
>> until connectivity between the nodes is repaired.  Yet, the node is
>> turn “off” (isolated) and then “on” (unisolated) immediately.  This
>> (kind-of) makes sense for a fencing agent that uses power to isolate,
>> since when it’s turned back on, pacemaker will not started any
>> resources on that node until it sees the other nodes (due to the
>> wait_for_all setting).  However, for other types of fencing agents,
>> it doesn’t make sense.  Does the “off” action not mean isolate from
>> shared storage? And the “on” action not mean unisolate?  What is the
>> correct way to understand fencing/stonith?
> I think the key idea is that "on" will be called when the fenced node
> asks to rejoin the cluster. So stopping that from happening until a
> sysadmin has intervened is an important part (if I'm not missing
> something).
>
> Note that if the fenced node still has network connectivity to the
> cluster, and the fenced node is actually operational, it will be
> notified by the cluster that it was fenced, and it will stop its
> pacemaker, thus fulfilling the requirement. But you obviously can't
> rely on that because fencing may be called precisely because network
> connectivity is lost or the host is not fully operational.
>
>> The behavior I wanted to see was, when pacemaker lost connectivity to
>> a node, it would run the off action for that node.  If this
>> succeeded, it could continue running resources.  Later, when
>> pacemaker saw the node again it would run the “on” action on the
>> fence agent (knowing that it was no longer split-brained).  Node2,
>> would try to do the same thing, but once it was fenced, it would not
>> longer attempt to fence node1.  It also wouldn’t attempt to start any
>> resources.  I thought that adding “requires unfencing” to the
>> resource would make this happen.  Is there a way to get this
>> behavior?
> That is basically what happens, the question is how "pacemaker saw the
> node again" becomes possible.
>
>> Thanks!
>>
>> btw, here's the cluster configuration:
>>
>> pcs cluster auth node1 node2
>> pcs cluster setup --name ataCluster node1 node2
>> pcs cluster start –all
>> pcs property set stonith-enabled=true
>> pcs resource defaults migration-threshold=1
>> pcs resource create Jaws ocf:atavium:myResource op stop on-fail=fence
>> meta requires=unfencing
>> pcs stonith create myStonith fence_custom op monitor interval=0 meta
>> provides=unfencing
>> pcs property set symmetric-cluster=true