[ClusterLabs] How is fencing and unfencing suppose to work?

Tue Aug 21 11:23:19 EDT 2018

I’m seeing unexpected behavior when using “unfencing” – I don’t think I’m
understanding it correctly.  I configured a resource that “requires
unfencing” and have a custom fencing agent which “provides unfencing”.   I
perform a simple test where I setup the cluster and then run “pcs stonith
fence node2”, and I see that node2 is successfully fenced by sending an
“off” action to my fencing agent.  But, immediately after this, I see an
“on” action sent to my fencing agent.  My fence agent doesn’t implement the
“reboot” action, so perhaps its trying to reboot by running an off action
followed by a on action.  Prior to adding “provides unfencing” to the
fencing agent, I didn’t see the on action. It seems unsafe to say “node2
you can’t run” and then immediately “ you can run”.

I don’t think I’m understanding this aspect of fencing/stonith.  I thought
that the fence agent acted as a proxy to a node, when the node was fenced,
it was isolated from shared storage by some means (power, fabric, etc).  It
seems like it shouldn’t become unfenced until connectivity between the
nodes is repaired.  Yet, the node is turn “off” (isolated) and then “on”
(unisolated) immediately.  This (kind-of) makes sense for a fencing agent
that uses power to isolate, since when it’s turned back on, pacemaker will
not started any resources on that node until it sees the other nodes (due
to the wait_for_all setting).  However, for other types of fencing agents,
it doesn’t make sense.  Does the “off” action not mean isolate from shared
storage? And the “on” action not mean unisolate?  What is the correct way
to understand fencing/stonith?

The behavior I wanted to see was, when pacemaker lost connectivity to a
node, it would run the off action for that node.  If this succeeded, it
could continue running resources.  Later, when pacemaker saw the node again
it would run the “on” action on the fence agent (knowing that it was no
longer split-brained).  Node2, would try to do the same thing, but once it
was fenced, it would not longer attempt to fence node1.  It also wouldn’t
attempt to start any resources.  I thought that adding “requires unfencing”
to the resource would make this happen.  Is there a way to get this
behavior?

Thanks!

btw, here's the cluster configuration:

   - pcs cluster auth node1 node2
   - pcs cluster setup --name ataCluster node1 node2
   - pcs cluster start –all
   - pcs property set stonith-enabled=true
   - pcs resource defaults migration-threshold=1
   - pcs resource create Jaws ocf:atavium:myResource op stop on-fail=fence
   meta requires=unfencing
   - pcs stonith create myStonith fence_custom op monitor interval=0 meta
   provides=unfencing
   - pcs property set symmetric-cluster=true
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180821/ab0eafe6/attachment.html>