[ClusterLabs] Automatic Recover for stonith:external/libvirt

Tue Jan 12 07:06:27 EST 2016

Thanks for the reply. After further successless testing for the 
automatic recover I had read this artikel:

  http://clusterlabs.org/doc/crm_fencing.html

There is a recommendation to monitor only once in a few hours the 
fencing device.

I am happy with it and so I configured the interval for monitoring at 
9600 secs (3-4 hours).

Cheers

Michael

On 08.01.2016 16:30, Ken Gaillot wrote:
> On 01/08/2016 08:56 AM, mr at inwx.de wrote:
>> Hello List,
>>
>> I have here a test environment for checking pacemaker. Sometimes our
>> kvm-hosts with libvirt have trouble with responding the stonith/libvirt
>> resource, so I like to configure the service to realize as failed after
>> three failed monitoring attempts. I was searching for a configuration
>> here:
>>
>>
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
>>
>>
>> But I failed after hours.
>>
>> That's the configuration line for stonith/libvirt:
>>
>> crm configure primitive p_fence_ha3 stonith:external/libvirt  params
>> hostlist="ha3" hypervisor_uri="qemu+tls://debian1/system" op monitor
>> interval="60"
>>
>> Every 60 seconds pacemaker makes something like this:
>>
>>   stonith -t external/libvirt hostlist="ha3"
>> hypervisor_uri="qemu+tls://debian1/system" -S
>>   ok
>>
>> To simulate the unavailability of the kvm host I remove the certificate
>> in /etc/libvirt/libvirtd.conf and restart libvirtd. After 60 seconds or
>> less I can see the error with "crm status". On the kvm host I add
>> certificate again to /etc/libvirt/libvirtd.conf and restart libvirt
>> again. Although libvirt is again available the stonith-resource did not
>> start again.
>>
>> I altered the configuration line for stonith/libvirt with following parts:
>>
>>   op monitor interval="60" pcmk_status_retries="3"
>>   op monitor interval="60" pcmk_monitor_retries="3"
>>   op monitor interval="60" start-delay=180
>>   meta migration-threshold="200" failure-timeout="120"
>>
>> But always with first failed monitor check after 60 or less seconds
>> pacemakers did not resume stonith-libvirt after libvirt is again available.
>
> Is there enough time left in the timeout for the cluster to retry? (The
> interval is not the same as the timeout.) Check your pacemaker.log for
> messages like "Attempted to execute agent ... the maximum number of
> times (...) allowed". That will tell you whether it is retrying.
>
> You definitely don't want start-delay, and migration-threshold doesn't
> really mean much for fence devices.
>
> Of course, you also want to fix the underlying problem of libvirt not
> being responsive. That doesn't sound like something that should
> routinely happen.
>
> BTW I haven't used stonith/external agents (which rely on the
> cluster-glue package) myself. I use the fence_virtd daemon on the host
> with fence_xvm as the configured fence agent.
>
>> Here is the "crm status"-output on debian 8 (Jessie):
>>
>>   root at ha4:~# crm status
>>   Last updated: Tue Jan  5 10:04:18 2016
>>   Last change: Mon Jan  4 18:18:12 2016
>>   Stack: corosync
>>   Current DC: ha3 (167772400) - partition with quorum
>>   Version: 1.1.12-561c4cf
>>   2 Nodes configured
>>   2 Resources configured
>>   Online: [ ha3 ha4 ]
>>   Service-IP     (ocf::heartbeat:IPaddr2):       Started ha3
>>   haproxy        (lsb:haproxy):  Started ha3
>>   p_fence_ha3    (stonith:external/libvirt):     Started ha4
>>
>> Kind regards
>>
>> Michael R.
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>