[Pacemaker] VirtualDomain Shutdown Timeout

Thu Mar 29 02:08:21 EDT 2012

On Sun, Mar 25, 2012 at 6:27 AM, Andrew Martin <amartin at xes-inc.com> wrote:
> Hello,
>
> I have configured a KVM virtual machine primitive using Pacemaker 1.1.6 and
> Heartbeat 3.0.5 on Ubuntu 10.04 Server using DRBD as the storage device (so
> there is no shared storage, no live-migration):
> primitive p_vm ocf:heartbeat:VirtualDomain \
>         params config="/vmstore/config/vm.xml" \
>         meta allow-migrate="false" \
>         op start interval="0" timeout="180s" \
>         op stop interval="0" timeout="120s" \
>         op monitor interval="10" timeout="30"
>
> I would expect the following events to happen on failover on the "from" node
> (the migration source) if the VM hangs while shutting down:
> 1. VirtualDomain issues "virsh shutdown vm" to gracefully shutdown the VM
> 2. pacemaker waits 120 seconds for the timeout specified in the "op stop"
> timeout
> 3. VirtualDomain waits a bit less than 120 seconds to see if it will
> gracefully shutdown. Once it gets to almost 120 seconds, it issues "virsh
> destroy vm" to hard stop the VM.
> 4. pacemaker wakes up from the 120 second timeout and sees that the VM has
> stopped and proceeds with the failover
>
> However, I observed that VirtualDomain seems to be using the timeout from
> the "op start" line, 180 seconds, yet pacemaker uses the 120 second timeout.
> Thus, the VM is still running after the pacemaker timeout is reached and so
> the node is STONITHed. Here is the relevant section of code from
> /usr/lib/ocf/resource.d/heartbeat/VirtualDomain:
> VirtualDomain_Stop() {
>     local i
>     local status
>     local shutdown_timeout
>     local out ex
>
>     VirtualDomain_Status
>     status=$?
>
>     case $status in
>         $OCF_SUCCESS)
>             if ! ocf_is_true $OCF_RESKEY_force_stop; then
>                 # Issue a graceful shutdown request
>                 ocf_log info "Issuing graceful shutdown request for domain
> ${DOMAIN_NAME}."
>                 virsh $VIRSH_OPTIONS shutdown ${DOMAIN_NAME}
>                 # The "shutdown_timeout" we use here is the operation
>                 # timeout specified in the CIB, minus 5 seconds
>                 shutdown_timeout=$(( $NOW +
> ($OCF_RESKEY_CRM_meta_timeout/1000) -5 ))
>                 # Loop on status until we reach $shutdown_timeout
>                 while [ $NOW -lt $shutdown_timeout ]; do
>
> Doesn't $OCF_RESKEY_CRM_meta_timeout correspond to the timeout value in the
> "op stop ..." line?

It should, however there was a bug in 1.1.6 where this wasn't the case.
The relevant patch is:
  https://github.com/beekhof/pacemaker/commit/fcfe6fe

Or you could try 1.1.7

>
> How can I optimize my pacemaker configuration so that the VM will attempt to
> gracefully shutdown and then at worst case destroy the VM before the
> pacemaker timeout is reached? Moreover, is there anything I can do inside of
> the VM (another Ubuntu 10.04 install) to optimize/speed up the shutdown
> process?
>
> Thanks,
>
> Andrew
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>