<div dir="ltr">Hi,<div><br></div><div>There are two problems mentioned in the email. </div><div><br></div><div>1) power-wait</div><div><br></div><div>Power-wait is a quite advanced option and there are only few fence devices/agent where it makes sense. And only because the HW/firmware on the device is somewhat broken. Basically, when we execute power ON/OFF operation, we wait for power-wait seconds before we send next command. I don't remember any issue with APC and this kind of problems.</div><div><br></div><div><br></div><div>2) <span style="font-size:13px">the only theory I could come up with was that maybe the fencing operation was considered complete too quickly? </span></div><div><span style="font-size:13px"><br></span></div><div><span style="font-size:13px">That is virtually not possible. Even when power ON/OFF is asynchronous, we test status of device and fence agent wait until status of the plug/VM/... matches what user wants. </span></div><div><span style="font-size:13px"><br></span></div><div><br></div><div><span style="font-size:13px">m,</span></div><div><span style="font-size:13px"><br></span></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Sep 2, 2016 at 3:14 PM, Dan Swartzendruber <span dir="ltr"><<a href="mailto:dswartz@druber.com" target="_blank">dswartz@druber.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
So, I was testing my ZFS dual-head JBOD 2-node cluster. Manual failovers worked just fine. I then went to try an acid-test by logging in to node A and doing 'systemctl stop network'. Sure enough, pacemaker told the APC fencing agent to power-cycle node A. The ZFS pool moved to node B as expected. As soon as node A was back up, I migrated the pool/IP back to node A. I *thought* all was okay, until a bit later, I did 'zpool status', and saw checksum errors on both sides of several of the vdevs. After much digging and poking, the only theory I could come up with was that maybe the fencing operation was considered complete too quickly? I googled for examples using this, and the best tutorial I found showed using a power-wait=5, whereas the default seems to be power-wait=0? (this is CentOS 7, btw...) I changed it to use 5 instead of 0, and did a several fencing operations while a guest VM (vsphere via NFS) was writing to the pool. So far, no evidence of corruption. BTW, the way I was creating and managing the cluster was with the lcmc java gui. Possibly the power-wait default of 0 comes from there, I can't really tell. Any thoughts or ideas appreciated :)<br>
<br>
______________________________<wbr>_________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org" target="_blank">Users@clusterlabs.org</a><br>
<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/mailman<wbr>/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/doc<wbr>/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
</blockquote></div><br></div>