[ClusterLabs] fence_apc delay?
Dan Swartzendruber
dswartz at druber.com
Fri Sep 2 13:14:30 UTC 2016
So, I was testing my ZFS dual-head JBOD 2-node cluster. Manual
failovers worked just fine. I then went to try an acid-test by logging
in to node A and doing 'systemctl stop network'. Sure enough, pacemaker
told the APC fencing agent to power-cycle node A. The ZFS pool moved to
node B as expected. As soon as node A was back up, I migrated the
pool/IP back to node A. I *thought* all was okay, until a bit later, I
did 'zpool status', and saw checksum errors on both sides of several of
the vdevs. After much digging and poking, the only theory I could come
up with was that maybe the fencing operation was considered complete too
quickly? I googled for examples using this, and the best tutorial I
found showed using a power-wait=5, whereas the default seems to be
power-wait=0? (this is CentOS 7, btw...) I changed it to use 5 instead
of 0, and did a several fencing operations while a guest VM (vsphere via
NFS) was writing to the pool. So far, no evidence of corruption. BTW,
the way I was creating and managing the cluster was with the lcmc java
gui. Possibly the power-wait default of 0 comes from there, I can't
really tell. Any thoughts or ideas appreciated :)
More information about the Users
mailing list