[Pacemaker] VirtualDomain Shutdown Timeout

Sat Mar 24 15:27:02 EDT 2012

Hello, 

I have configured a KVM virtual machine primitive using Pacemaker 1.1.6 and Heartbeat 3.0.5 on Ubuntu 10.04 Server using DRBD as the storage device (so there is no shared storage, no live-migration): 

primitive p_vm ocf:heartbeat:VirtualDomain \ 
params config="/vmstore/config/vm.xml" \ 
meta allow-migrate="false" \ 
op start interval="0" timeout="180s" \ 
op stop interval="0" timeout="120s" \ 
op monitor interval="10" timeout="30" 

I would expect the following events to happen on failover on the "from" node (the migration source) if the VM hangs while shutting down: 
1. VirtualDomain issues "virsh shutdown vm" to gracefully shutdown the VM 
2. pacemaker waits 120 seconds for the timeout specified in the "op stop" timeout 
3. VirtualDomain waits a bit less than 120 seconds to see if it will gracefully shutdown. Once it gets to almost 120 seconds, it issues "virsh destroy vm" to hard stop the VM. 
4. pacemaker wakes up from the 120 second timeout and sees that the VM has stopped and proceeds with the failover 

However, I observed that VirtualDomain seems to be using the timeout from the "op start" line, 180 seconds, yet pacemaker uses the 120 second timeout. Thus, the VM is still running after the pacemaker timeout is reached and so the node is STONITHed. Here is the relevant section of code from /usr/lib/ocf/resource.d/heartbeat/VirtualDomain: 
VirtualDomain_Stop() { 
local i 
local status 
local shutdown_timeout 
local out ex 

VirtualDomain_Status 
status=$? 

case $status in 
$OCF_SUCCESS) 
if ! ocf_is_true $OCF_RESKEY_force_stop; then 
# Issue a graceful shutdown request 
ocf_log info "Issuing graceful shutdown request for domain ${DOMAIN_NAME}." 
virsh $VIRSH_OPTIONS shutdown ${DOMAIN_NAME} 
# The "shutdown_timeout" we use here is the operation 
# timeout specified in the CIB, minus 5 seconds 
shutdown_timeout=$(( $NOW + ($OCF_RESKEY_CRM_meta_timeout/1000) -5 )) 
# Loop on status until we reach $shutdown_timeout 
while [ $NOW -lt $shutdown_timeout ]; do 

Doesn't $OCF_RESKEY_CRM_meta_timeout correspond to the timeout value in the "op stop ..." line? 

How can I optimize my pacemaker configuration so that the VM will attempt to gracefully shutdown and then at worst case destroy the VM before the pacemaker timeout is reached? Moreover, is there anything I can do inside of the VM (another Ubuntu 10.04 install) to optimize/speed up the shutdown process? 

Thanks, 

Andrew 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120324/3dbc1e37/attachment-0002.html>