[ClusterLabs] fence_apc delay?

Fri Sep 2 13:14:30 UTC 2016

So, I was testing my ZFS dual-head JBOD 2-node cluster.  Manual 
failovers worked just fine.  I then went to try an acid-test by logging 
in to node A and doing 'systemctl stop network'.  Sure enough, pacemaker 
told the APC fencing agent to power-cycle node A.  The ZFS pool moved to 
node B as expected.  As soon as node A was back up, I migrated the 
pool/IP back to node A.  I *thought* all was okay, until a bit later, I 
did 'zpool status', and saw checksum errors on both sides of several of 
the vdevs.  After much digging and poking, the only theory I could come 
up with was that maybe the fencing operation was considered complete too 
quickly?  I googled for examples using this, and the best tutorial I 
found showed using a power-wait=5, whereas the default seems to be 
power-wait=0?  (this is CentOS 7, btw...)  I changed it to use 5 instead 
of 0, and did a several fencing operations while a guest VM (vsphere via 
NFS) was writing to the pool.  So far, no evidence of corruption.  BTW, 
the way I was creating and managing the cluster was with the lcmc java 
gui.  Possibly the power-wait default of 0 comes from there, I can't 
really tell.  Any thoughts or ideas appreciated :)