[Pacemaker] A caveat in the VirtualDomain resource agent

Dejan Muhamedagic dejanmm at fastmail.fm
Fri Aug 22 13:32:40 UTC 2014


Hi,

On Fri, Aug 22, 2014 at 10:23:29AM +0200, Cédric Dufour - Idiap Research Institute wrote:
> Hello,
> 
> Is this the right place to report this issue? (please redirect me if not)

Yes. Though bugs/issues/fixes are nowadays mostly handled at
github.com/ClusterLabs/resource-agents and reports there have
certainly more visibility.

> As we were experiencing/demonstrating our new cluster yesterday, we stumbled on a caveat in our LibvirtQemu resource agent (derived from VirtualDomain). Since the caveat is the same in the VirtualDomain resource agent; I thought I better report it. Please see the patch below (for LibvirtQemu), which comments should allow you to understand where the problem lies.

Perhaps I missed something, but may I ask why did you decide to
create a new RA instead of improving the existing one? Was there
anything in VirtualDomain making it unsuitable for your use
case?

> --- LibvirtQemu.orig    2014-08-22 09:39:21.997201000 +0200
> +++ LibvirtQemu    2014-08-22 09:50:32.440969000 +0200
> @@ -154,11 +154,10 @@
>    local virsh_output
>    local domain_name
>  
> -  # Note: passing in the domain name from outside the script is
> -  # intended for testing and debugging purposes only. Don't do this
> -  # in production, instead let the script figure out the domain name
> -  # from the config file. You have been warned.
> -  if [ -z "${DOMAIN_NAME}" ]; then
> +  # NOTE: Re-defining an already defined domain is dangerous! It shall be done only
> +  # if we can reasonably assume the configuration file hasn't changed since the last
> +  # time the domain has been defined.
> +  if [ -z "${DOMAIN_NAME}" ] || [ "${OCF_RESKEY_config}" -ot "${STATEFILE}" ]; then
>      # Spin until we have a domain name
>      while true; do
>        virsh_output="$(virsh ${VIRSH_OPTIONS} define ${OCF_RESKEY_config})"
> @@ -170,7 +169,7 @@
>      echo "${domain_name}" > "${STATEFILE}"
>      ocf_log info "Domain name '${domain_name}' saved to state file '${STATEFILE}'."
>    else
> -    ocf_log warn "Domain name '${DOMAIN_NAME}' already defined; overriding configuration file '${OCF_RESKEY_config}' (this should NOT ne done in production!)."
> +    ocf_log warn "Domain name '${DOMAIN_NAME}' already defined; overriding by newer configuration file will NOT be done!"
>    fi
>  }

Under which circumstances did you run into these issues?
There were some recent additions which enable saving the changes
back to the configuration file. Would that help?

Cheers,

Dejan

> @@ -205,12 +204,12 @@
>          ;;
>        ''|'no state')
>          # Empty string may be returned when virsh does not
> -        # receive a reply from libvirtd.
> +        # receive a reply from libvirtd or after the domain has
> +        # been undefined.
>          # "no state" may occur when the domain is currently
>          # being migrated (on the migration target only), or
>          # whenever virsh can't reliably obtain the domain
>          # state.
> -        status='no state'
>          if [ "${__OCF_ACTION}" == 'stop' ] && [ ${try} -ge 3 ]; then
>            # During the stop operation, we want to bail out
>            # quickly, so as to be able to force-stop (destroy)
> @@ -224,6 +223,17 @@
>            ocf_log info "Domain '${DOMAIN_NAME}' currently has no state; retrying."
>            sleep 1
>          fi
> +        if [ "${status}" == '' ] && [ $(( ${try} % 10 )) -eq 0 ]; then
> +          # Could it be that libvirtd is running healthily but the domain
> +          # has been undefined? In that case, let's attempt to re-define it.
> +          # If libvirtd IS running, it can not hurt (given the safeguards in
> +          # LibvirtQemu_Define). If libvirtd is NOT running, then something is
> +          # definitely wrong (and the monitor operation will time-out in
> +          # LibvirtQemu_Define the same way as it would here).
> +          ocf_log warn "Has domain '${DOMAIN_NAME}' been undefined? attempting to re-define it."
> +          LibvirtQemu_Define
> +        fi
> +        status='no state'
>          ;;
>        *)
>          # any other output is unexpected.
> @@ -487,6 +497,11 @@
>  
>  # Define the domain on startup, and re-define whenever someone deleted
>  # the state file, or touched the config.
> +# WARNING: There is a caveat here! When the resource is stopped, the state file
> +# is deleted ONLY on the node where it was running. In case the domain is then
> +# undefined (from libvirtd), on all nodes, we will end-up with a state file but no
> +# domain definition on those nodes that were not running the resource. The monitor
> +# operation MUST handle that situation, should the resource be restarted.
>  if [ ! -e "${STATEFILE}" ] || [ "${OCF_RESKEY_config}" -nt "${STATEFILE}" ]; then
>    LibvirtQemu_Define
>  fi
> 
> One could ask "why undefine a libvirt domain and then restart it?". The answer is two-fold: 1. experience showed us that we shall undefine a decommissioned domain from libvirt to prevent potential UUID conflict when defining a new domain (which is likely in our setup, since UUID are build from the domain IP address); 2. the "demo-effect" (or potential legitimate reasons), where one would "decommission" a domain and restart it right afterwards ( :-/ ).
> 
> PS: we now also make sure to delete the VirtualDomain/LibvirtQemu state file when undefining the domain. But best have multiple safe guards as far as this caveat is concerned (thus the patch above).
> 
> Hope it helps,
> 
> Cédric
> 
> -- 
> 
> Cédric Dufour @ Idiap Research Institute
> 

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org





More information about the Pacemaker mailing list