[ClusterLabs] Automatic Recover for stonith:external/libvirt

Fri Jan 8 10:30:36 EST 2016

On 01/08/2016 08:56 AM, mr at inwx.de wrote:
> Hello List,
> 
> I have here a test environment for checking pacemaker. Sometimes our
> kvm-hosts with libvirt have trouble with responding the stonith/libvirt
> resource, so I like to configure the service to realize as failed after
> three failed monitoring attempts. I was searching for a configuration 
> here:
> 
> 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
> 
> 
> But I failed after hours.
> 
> That's the configuration line for stonith/libvirt:
> 
> crm configure primitive p_fence_ha3 stonith:external/libvirt  params
> hostlist="ha3" hypervisor_uri="qemu+tls://debian1/system" op monitor
> interval="60"
> 
> Every 60 seconds pacemaker makes something like this:
> 
>  stonith -t external/libvirt hostlist="ha3"
> hypervisor_uri="qemu+tls://debian1/system" -S
>  ok
> 
> To simulate the unavailability of the kvm host I remove the certificate
> in /etc/libvirt/libvirtd.conf and restart libvirtd. After 60 seconds or
> less I can see the error with "crm status". On the kvm host I add
> certificate again to /etc/libvirt/libvirtd.conf and restart libvirt
> again. Although libvirt is again available the stonith-resource did not
> start again.
> 
> I altered the configuration line for stonith/libvirt with following parts:
> 
>  op monitor interval="60" pcmk_status_retries="3"
>  op monitor interval="60" pcmk_monitor_retries="3"
>  op monitor interval="60" start-delay=180
>  meta migration-threshold="200" failure-timeout="120"
> 
> But always with first failed monitor check after 60 or less seconds
> pacemakers did not resume stonith-libvirt after libvirt is again available.

Is there enough time left in the timeout for the cluster to retry? (The
interval is not the same as the timeout.) Check your pacemaker.log for
messages like "Attempted to execute agent ... the maximum number of
times (...) allowed". That will tell you whether it is retrying.

You definitely don't want start-delay, and migration-threshold doesn't
really mean much for fence devices.

Of course, you also want to fix the underlying problem of libvirt not
being responsive. That doesn't sound like something that should
routinely happen.

BTW I haven't used stonith/external agents (which rely on the
cluster-glue package) myself. I use the fence_virtd daemon on the host
with fence_xvm as the configured fence agent.

> Here is the "crm status"-output on debian 8 (Jessie):
> 
>  root at ha4:~# crm status
>  Last updated: Tue Jan  5 10:04:18 2016
>  Last change: Mon Jan  4 18:18:12 2016
>  Stack: corosync
>  Current DC: ha3 (167772400) - partition with quorum
>  Version: 1.1.12-561c4cf
>  2 Nodes configured
>  2 Resources configured
>  Online: [ ha3 ha4 ]
>  Service-IP     (ocf::heartbeat:IPaddr2):       Started ha3
>  haproxy        (lsb:haproxy):  Started ha3
>  p_fence_ha3    (stonith:external/libvirt):     Started ha4
> 
> Kind regards
> 
> Michael R.