[ClusterLabs] How is fencing and unfencing suppose to work?

Fri Sep 28 15:30:22 UTC 2018

Update:  It seems like fencing does work as I expected it to work.  The
problem was with how I was testing it.  I was seeing the node  turned “off”
(isolated) and then “on” (unisolated) immediately which seemed wrong.  This
was because the way I was turning the node off in my testing was to kill
the some processes, including the pacemaker and corosync processes.
However the systemd unit file for pacemaker/corosync is configured to
restart the service immediately if it dies.  So, I was seeing the "on" call
immediately after the "off" because the pacemaker/corosync service was
restarted, so it appeared the node I just killed, immediately came back.
Thanks,
Ryan

On Tue, Sep 4, 2018 at 7:49 PM Ken Gaillot <kgaillot at redhat.com> wrote:

> On Tue, 2018-08-21 at 10:23 -0500, Ryan Thomas wrote:
> > I’m seeing unexpected behavior when using “unfencing” – I don’t think
> > I’m understanding it correctly.  I configured a resource that
> > “requires unfencing” and have a custom fencing agent which “provides
> > unfencing”.   I perform a simple test where I setup the cluster and
> > then run “pcs stonith fence node2”, and I see that node2 is
> > successfully fenced by sending an “off” action to my fencing agent.
> > But, immediately after this, I see an “on” action sent to my fencing
> > agent.  My fence agent doesn’t implement the “reboot” action, so
> > perhaps its trying to reboot by running an off action followed by a
> > on action.  Prior to adding “provides unfencing” to the fencing
> > agent, I didn’t see the on action. It seems unsafe to say “node2 you
> > can’t run” and then immediately “ you can run”.
>
> I'm not as familiar with unfencing as I'd like, but I believe the basic
> idea is:
>
> - the fence agent's off action cuts the machine off from something
> essential needed to run resources (generally shared storage or network
> access)
>
> - the fencing works such that a fenced host is not able to request
> rejoining the cluster without manual intervention by a sysadmin
>
> - when the sysadmin allows the host back into the cluster, and it
> contacts the other nodes to rejoin, the cluster will call the fence
> agent's on action, which is expected to re-enable the host's access
>
> How that works in practice, I have only vague knowledge.
>
> > I don’t think I’m understanding this aspect of fencing/stonith.  I
> > thought that the fence agent acted as a proxy to a node, when the
> > node was fenced, it was isolated from shared storage by some means
> > (power, fabric, etc).  It seems like it shouldn’t become unfenced
> > until connectivity between the nodes is repaired.  Yet, the node is
> > turn “off” (isolated) and then “on” (unisolated) immediately.  This
> > (kind-of) makes sense for a fencing agent that uses power to isolate,
> > since when it’s turned back on, pacemaker will not started any
> > resources on that node until it sees the other nodes (due to the
> > wait_for_all setting).  However, for other types of fencing agents,
> > it doesn’t make sense.  Does the “off” action not mean isolate from
> > shared storage? And the “on” action not mean unisolate?  What is the
> > correct way to understand fencing/stonith?
>
> I think the key idea is that "on" will be called when the fenced node
> asks to rejoin the cluster. So stopping that from happening until a
> sysadmin has intervened is an important part (if I'm not missing
> something).
>
> Note that if the fenced node still has network connectivity to the
> cluster, and the fenced node is actually operational, it will be
> notified by the cluster that it was fenced, and it will stop its
> pacemaker, thus fulfilling the requirement. But you obviously can't
> rely on that because fencing may be called precisely because network
> connectivity is lost or the host is not fully operational.
>
> > The behavior I wanted to see was, when pacemaker lost connectivity to
> > a node, it would run the off action for that node.  If this
> > succeeded, it could continue running resources.  Later, when
> > pacemaker saw the node again it would run the “on” action on the
> > fence agent (knowing that it was no longer split-brained).  Node2,
> > would try to do the same thing, but once it was fenced, it would not
> > longer attempt to fence node1.  It also wouldn’t attempt to start any
> > resources.  I thought that adding “requires unfencing” to the
> > resource would make this happen.  Is there a way to get this
> > behavior?
>
> That is basically what happens, the question is how "pacemaker saw the
> node again" becomes possible.
>
> >
> > Thanks!
> >
> > btw, here's the cluster configuration:
> >
> > pcs cluster auth node1 node2
> > pcs cluster setup --name ataCluster node1 node2
> > pcs cluster start –all
> > pcs property set stonith-enabled=true
> > pcs resource defaults migration-threshold=1
> > pcs resource create Jaws ocf:atavium:myResource op stop on-fail=fence
> > meta requires=unfencing
> > pcs stonith create myStonith fence_custom op monitor interval=0 meta
> > provides=unfencing
> > pcs property set symmetric-cluster=true
> --
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180928/a6ceea36/attachment-0001.html>