[ClusterLabs] fence_apc delay?

Fri Sep 2 14:16:14 UTC 2016

On 2016-09-02 10:09, Ken Gaillot wrote:
> On 09/02/2016 08:14 AM, Dan Swartzendruber wrote:
>> 
>> So, I was testing my ZFS dual-head JBOD 2-node cluster.  Manual
>> failovers worked just fine.  I then went to try an acid-test by 
>> logging
>> in to node A and doing 'systemctl stop network'.  Sure enough, 
>> pacemaker
>> told the APC fencing agent to power-cycle node A.  The ZFS pool moved 
>> to
>> node B as expected.  As soon as node A was back up, I migrated the
>> pool/IP back to node A.  I *thought* all was okay, until a bit later, 
>> I
>> did 'zpool status', and saw checksum errors on both sides of several 
>> of
>> the vdevs.  After much digging and poking, the only theory I could 
>> come
>> up with was that maybe the fencing operation was considered complete 
>> too
>> quickly?  I googled for examples using this, and the best tutorial I
>> found showed using a power-wait=5, whereas the default seems to be
>> power-wait=0?  (this is CentOS 7, btw...)  I changed it to use 5 
>> instead
> 
> That's a reasonable theory -- that's why power_wait is available. It
> would be nice if there were a page collecting users' experience with 
> the
> ideal power_wait for various devices. Even better if fence-agents used
> those values as the defaults.

Ken, thanks.  FWIW, this is a Dell Poweredge R905.  I have no idea how 
long the power supplies in that thing can keep things going when A/C 
goes away.  Always wary of small sample sizes, but I got filesystem 
corruption after 1 fencing event with power_wait=0, and none after 3 
fencing events with power_wait=5.