[ClusterLabs] Automatic Recover for stonith:external/libvirt
mr at inwx.de
mr at inwx.de
Tue Jan 12 07:06:27 EST 2016
Thanks for the reply. After further successless testing for the
automatic recover I had read this artikel:
There is a recommendation to monitor only once in a few hours the
I am happy with it and so I configured the interval for monitoring at
9600 secs (3-4 hours).
On 08.01.2016 16:30, Ken Gaillot wrote:
> On 01/08/2016 08:56 AM, mr at inwx.de wrote:
>> Hello List,
>> I have here a test environment for checking pacemaker. Sometimes our
>> kvm-hosts with libvirt have trouble with responding the stonith/libvirt
>> resource, so I like to configure the service to realize as failed after
>> three failed monitoring attempts. I was searching for a configuration
>> But I failed after hours.
>> That's the configuration line for stonith/libvirt:
>> crm configure primitive p_fence_ha3 stonith:external/libvirt params
>> hostlist="ha3" hypervisor_uri="qemu+tls://debian1/system" op monitor
>> Every 60 seconds pacemaker makes something like this:
>> stonith -t external/libvirt hostlist="ha3"
>> hypervisor_uri="qemu+tls://debian1/system" -S
>> To simulate the unavailability of the kvm host I remove the certificate
>> in /etc/libvirt/libvirtd.conf and restart libvirtd. After 60 seconds or
>> less I can see the error with "crm status". On the kvm host I add
>> certificate again to /etc/libvirt/libvirtd.conf and restart libvirt
>> again. Although libvirt is again available the stonith-resource did not
>> start again.
>> I altered the configuration line for stonith/libvirt with following parts:
>> op monitor interval="60" pcmk_status_retries="3"
>> op monitor interval="60" pcmk_monitor_retries="3"
>> op monitor interval="60" start-delay=180
>> meta migration-threshold="200" failure-timeout="120"
>> But always with first failed monitor check after 60 or less seconds
>> pacemakers did not resume stonith-libvirt after libvirt is again available.
> Is there enough time left in the timeout for the cluster to retry? (The
> interval is not the same as the timeout.) Check your pacemaker.log for
> messages like "Attempted to execute agent ... the maximum number of
> times (...) allowed". That will tell you whether it is retrying.
> You definitely don't want start-delay, and migration-threshold doesn't
> really mean much for fence devices.
> Of course, you also want to fix the underlying problem of libvirt not
> being responsive. That doesn't sound like something that should
> routinely happen.
> BTW I haven't used stonith/external agents (which rely on the
> cluster-glue package) myself. I use the fence_virtd daemon on the host
> with fence_xvm as the configured fence agent.
>> Here is the "crm status"-output on debian 8 (Jessie):
>> root at ha4:~# crm status
>> Last updated: Tue Jan 5 10:04:18 2016
>> Last change: Mon Jan 4 18:18:12 2016
>> Stack: corosync
>> Current DC: ha3 (167772400) - partition with quorum
>> Version: 1.1.12-561c4cf
>> 2 Nodes configured
>> 2 Resources configured
>> Online: [ ha3 ha4 ]
>> Service-IP (ocf::heartbeat:IPaddr2): Started ha3
>> haproxy (lsb:haproxy): Started ha3
>> p_fence_ha3 (stonith:external/libvirt): Started ha4
>> Kind regards
>> Michael R.
> Users mailing list: Users at clusterlabs.org
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users