<div dir="ltr">Update:  It seems like fencing does work as I expected it to work.  The problem was with how I was testing it.  I was seeing the <font face="Calibri, sans-serif"><span style="font-size:14.6667px">node  turned “off” (isolated) and then “on” (unisolated) immediately which seemed wrong.  This was because the way I was turning the node off in my testing was to kill the some processes, including the pacemaker and corosync processes.  However the systemd unit file for pacemaker/corosync is configured to restart the service immediately if it dies.  So, I was seeing the "on" call immediately after the "off" because the pacemaker/corosync service was restarted, so it appeared the node I just killed, immediately came back.</span></font><div><font face="Calibri, sans-serif"><span style="font-size:14.6667px">Thanks,</span></font></div><div><font face="Calibri, sans-serif"><span style="font-size:14.6667px">Ryan</span></font></div></div><br><div class="gmail_quote"><div dir="ltr">On Tue, Sep 4, 2018 at 7:49 PM Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Tue, 2018-08-21 at 10:23 -0500, Ryan Thomas wrote:<br>

> I’m seeing unexpected behavior when using “unfencing” – I don’t think<br>

> I’m understanding it correctly.  I configured a resource that<br>

> “requires unfencing” and have a custom fencing agent which “provides<br>

> unfencing”.   I perform a simple test where I setup the cluster and<br>

> then run “pcs stonith fence node2”, and I see that node2 is<br>

> successfully fenced by sending an “off” action to my fencing agent. <br>

> But, immediately after this, I see an “on” action sent to my fencing<br>

> agent.  My fence agent doesn’t implement the “reboot” action, so<br>

> perhaps its trying to reboot by running an off action followed by a<br>

> on action.  Prior to adding “provides unfencing” to the fencing<br>

> agent, I didn’t see the on action. It seems unsafe to say “node2 you<br>

> can’t run” and then immediately “ you can run”.<br>

<br>

I'm not as familiar with unfencing as I'd like, but I believe the basic<br>

idea is:<br>

<br>

- the fence agent's off action cuts the machine off from something<br>

essential needed to run resources (generally shared storage or network<br>

access)<br>

<br>

- the fencing works such that a fenced host is not able to request<br>

rejoining the cluster without manual intervention by a sysadmin<br>

<br>

- when the sysadmin allows the host back into the cluster, and it<br>

contacts the other nodes to rejoin, the cluster will call the fence<br>

agent's on action, which is expected to re-enable the host's access<br>

<br>

How that works in practice, I have only vague knowledge.<br>

<br>

> I don’t think I’m understanding this aspect of fencing/stonith.  I<br>

> thought that the fence agent acted as a proxy to a node, when the<br>

> node was fenced, it was isolated from shared storage by some means<br>

> (power, fabric, etc).  It seems like it shouldn’t become unfenced<br>

> until connectivity between the nodes is repaired.  Yet, the node is<br>

> turn “off” (isolated) and then “on” (unisolated) immediately.  This<br>

> (kind-of) makes sense for a fencing agent that uses power to isolate,<br>

> since when it’s turned back on, pacemaker will not started any<br>

> resources on that node until it sees the other nodes (due to the<br>

> wait_for_all setting).  However, for other types of fencing agents,<br>

> it doesn’t make sense.  Does the “off” action not mean isolate from<br>

> shared storage? And the “on” action not mean unisolate?  What is the<br>

> correct way to understand fencing/stonith?<br>

<br>

I think the key idea is that "on" will be called when the fenced node<br>

asks to rejoin the cluster. So stopping that from happening until a<br>

sysadmin has intervened is an important part (if I'm not missing<br>

something).<br>

<br>

Note that if the fenced node still has network connectivity to the<br>

cluster, and the fenced node is actually operational, it will be<br>

notified by the cluster that it was fenced, and it will stop its<br>

pacemaker, thus fulfilling the requirement. But you obviously can't<br>

rely on that because fencing may be called precisely because network<br>

connectivity is lost or the host is not fully operational.<br>

<br>

> The behavior I wanted to see was, when pacemaker lost connectivity to<br>

> a node, it would run the off action for that node.  If this<br>

> succeeded, it could continue running resources.  Later, when<br>

> pacemaker saw the node again it would run the “on” action on the<br>

> fence agent (knowing that it was no longer split-brained).  Node2,<br>

> would try to do the same thing, but once it was fenced, it would not<br>

> longer attempt to fence node1.  It also wouldn’t attempt to start any<br>

> resources.  I thought that adding “requires unfencing” to the<br>

> resource would make this happen.  Is there a way to get this<br>

> behavior?<br>

<br>

That is basically what happens, the question is how "pacemaker saw the<br>

node again" becomes possible.<br>

<br>

> <br>

> Thanks! <br>

> <br>

> btw, here's the cluster configuration:<br>

> <br>

> pcs cluster auth node1 node2<br>

> pcs cluster setup --name ataCluster node1 node2<br>

> pcs cluster start –all<br>

> pcs property set stonith-enabled=true<br>

> pcs resource defaults migration-threshold=1<br>

> pcs resource create Jaws ocf:atavium:myResource op stop on-fail=fence <br>

> meta requires=unfencing<br>

> pcs stonith create myStonith fence_custom op monitor interval=0 meta<br>

> provides=unfencing<br>

> pcs property set symmetric-cluster=true<br>

-- <br>

Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>><br>

_______________________________________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</blockquote></div>