[ClusterLabs] fence_apc delay?

Fri Sep 2 14:09:13 UTC 2016

On 09/02/2016 08:14 AM, Dan Swartzendruber wrote:
> 
> So, I was testing my ZFS dual-head JBOD 2-node cluster.  Manual
> failovers worked just fine.  I then went to try an acid-test by logging
> in to node A and doing 'systemctl stop network'.  Sure enough, pacemaker
> told the APC fencing agent to power-cycle node A.  The ZFS pool moved to
> node B as expected.  As soon as node A was back up, I migrated the
> pool/IP back to node A.  I *thought* all was okay, until a bit later, I
> did 'zpool status', and saw checksum errors on both sides of several of
> the vdevs.  After much digging and poking, the only theory I could come
> up with was that maybe the fencing operation was considered complete too
> quickly?  I googled for examples using this, and the best tutorial I
> found showed using a power-wait=5, whereas the default seems to be
> power-wait=0?  (this is CentOS 7, btw...)  I changed it to use 5 instead

That's a reasonable theory -- that's why power_wait is available. It
would be nice if there were a page collecting users' experience with the
ideal power_wait for various devices. Even better if fence-agents used
those values as the defaults.

> of 0, and did a several fencing operations while a guest VM (vsphere via
> NFS) was writing to the pool.  So far, no evidence of corruption.  BTW,
> the way I was creating and managing the cluster was with the lcmc java
> gui.  Possibly the power-wait default of 0 comes from there, I can't
> really tell.  Any thoughts or ideas appreciated :)