[ClusterLabs] Antw: Re: fence_apc delay?

Mon Sep 5 07:04:53 UTC 2016

>>> Marek Grac <mgrac at redhat.com> schrieb am 03.09.2016 um 14:41 in Nachricht
<CA+40=JWs_6hjgLaJCSZAqa6o9RqH79OA9aQ150z1+5Kjst_niQ at mail.gmail.com>:
> Hi,
> 
> There are two problems mentioned in the email.
> 
> 1) power-wait
> 
> Power-wait is a quite advanced option and there are only few fence
> devices/agent where it makes sense. And only because the HW/firmware on the
> device is somewhat broken. Basically, when we execute power ON/OFF
> operation, we wait for power-wait seconds before we send next command. I
> don't remember any issue with APC and this kind of problems.
> 
> 
> 2) the only theory I could come up with was that maybe the fencing
> operation was considered complete too quickly?
> 
> That is virtually not possible. Even when power ON/OFF is asynchronous, we
> test status of device and fence agent wait until status of the plug/VM/...
> matches what user wants.

I can imagine that a powerful power supply can deliver up to one second of power even if the mains is disconnected. If the cluster is very quick after fencing, it might be a problem. I'd suggest a 5 to 10 second delay between fencing action and cluster reaction.

> 
> 
> m,
> 
> 
> On Fri, Sep 2, 2016 at 3:14 PM, Dan Swartzendruber <dswartz at druber.com>
> wrote:
> 
>>
>> So, I was testing my ZFS dual-head JBOD 2-node cluster.  Manual failovers
>> worked just fine.  I then went to try an acid-test by logging in to node A
>> and doing 'systemctl stop network'.  Sure enough, pacemaker told the APC
>> fencing agent to power-cycle node A.  The ZFS pool moved to node B as
>> expected.  As soon as node A was back up, I migrated the pool/IP back to
>> node A.  I *thought* all was okay, until a bit later, I did 'zpool status',
>> and saw checksum errors on both sides of several of the vdevs.  After much
>> digging and poking, the only theory I could come up with was that maybe the
>> fencing operation was considered complete too quickly?  I googled for
>> examples using this, and the best tutorial I found showed using a
>> power-wait=5, whereas the default seems to be power-wait=0?  (this is
>> CentOS 7, btw...)  I changed it to use 5 instead of 0, and did a several
>> fencing operations while a guest VM (vsphere via NFS) was writing to the
>> pool.  So far, no evidence of corruption.  BTW, the way I was creating and
>> managing the cluster was with the lcmc java gui.  Possibly the power-wait
>> default of 0 comes from there, I can't really tell.  Any thoughts or ideas
>> appreciated :)
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>>