[ClusterLabs] Antw: Re: fence_apc delay?
Ulrich.Windl at rz.uni-regensburg.de
Mon Sep 5 03:04:53 EDT 2016
>>> Marek Grac <mgrac at redhat.com> schrieb am 03.09.2016 um 14:41 in Nachricht
<CA+40=JWs_6hjgLaJCSZAqa6o9RqH79OA9aQ150z1+5Kjst_niQ at mail.gmail.com>:
> There are two problems mentioned in the email.
> 1) power-wait
> Power-wait is a quite advanced option and there are only few fence
> devices/agent where it makes sense. And only because the HW/firmware on the
> device is somewhat broken. Basically, when we execute power ON/OFF
> operation, we wait for power-wait seconds before we send next command. I
> don't remember any issue with APC and this kind of problems.
> 2) the only theory I could come up with was that maybe the fencing
> operation was considered complete too quickly?
> That is virtually not possible. Even when power ON/OFF is asynchronous, we
> test status of device and fence agent wait until status of the plug/VM/...
> matches what user wants.
I can imagine that a powerful power supply can deliver up to one second of power even if the mains is disconnected. If the cluster is very quick after fencing, it might be a problem. I'd suggest a 5 to 10 second delay between fencing action and cluster reaction.
> On Fri, Sep 2, 2016 at 3:14 PM, Dan Swartzendruber <dswartz at druber.com>
>> So, I was testing my ZFS dual-head JBOD 2-node cluster. Manual failovers
>> worked just fine. I then went to try an acid-test by logging in to node A
>> and doing 'systemctl stop network'. Sure enough, pacemaker told the APC
>> fencing agent to power-cycle node A. The ZFS pool moved to node B as
>> expected. As soon as node A was back up, I migrated the pool/IP back to
>> node A. I *thought* all was okay, until a bit later, I did 'zpool status',
>> and saw checksum errors on both sides of several of the vdevs. After much
>> digging and poking, the only theory I could come up with was that maybe the
>> fencing operation was considered complete too quickly? I googled for
>> examples using this, and the best tutorial I found showed using a
>> power-wait=5, whereas the default seems to be power-wait=0? (this is
>> CentOS 7, btw...) I changed it to use 5 instead of 0, and did a several
>> fencing operations while a guest VM (vsphere via NFS) was writing to the
>> pool. So far, no evidence of corruption. BTW, the way I was creating and
>> managing the cluster was with the lcmc java gui. Possibly the power-wait
>> default of 0 comes from there, I can't really tell. Any thoughts or ideas
>> appreciated :)
>> Users mailing list: Users at clusterlabs.org
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
More information about the Users