[ClusterLabs] Failed VM (libvirt) stop

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Fri Aug 6 09:24:33 EDT 2021


This is not strictly a cluster question, but a resource agent question:
I had a case when a Xen PVM could not be stopped when it was in either GRUB or early boot phase.
I noticed that he VM would not stop while being connected to the (text) console, so I inspected "xentop":
There the VM has "s"-state (shutdown).
The console output was:
Loading Linux 5.3.18-59.16-default ...
Loading initial ramdisk ...
[    2.038092] Cannot find an available gap in the 32-bit address range
[    2.038094] PCI devices with unassigned 32-bit BARs may not work!
[    2.490713] reboot: Power down

Local log messages were:
Aug 06 08:25:03 h19 VirtualDomain(prm_xen_v01)[10468]: INFO: Issuing graceful shutdown request for domain v01.
Aug 06 08:25:28 h19 kernel: xen-blkback: backend/vbd/25/51744: prepare for reconnect
Aug 06 08:25:28 h19 kernel: xen-blkback: backend/vbd/25/51760: prepare for reconnect
Aug 06 08:30:03 h19 pacemaker-execd[11667]:  warning: prm_xen_v01_stop_0 process (PID 10435) timed out
Aug 06 08:30:03 h19 pacemaker-execd[11667]:  warning: prm_xen_v01_stop_0[10435] timed out after 300000ms
Aug 06 08:30:03 h19 pacemaker-execd[11667]:  notice: prm_xen_v01 stop (call 337, PID 10435) exited with status 1 (execution time 300007ms, queue time 0ms)
Aug 06 08:30:03 h19 pacemaker-controld[11670]:  error: Result of stop operation for prm_xen_v01 on h19: Timed Out
Aug 06 08:30:03 h19 libvirtd[13675]: End of file while reading data: Input/output error

Is that a problem in Xen, libvirt or the RA?
Specifically I'm missing a forced shutdown (like "m destroy" before the stop timed out.

The RA doc says: "The default behavior is to resort to a forceful shutdown only after a graceful
shutdown attempt has failed."

Browsing the RA, I suspect that when either "virsh shutdown" is waiting for completion or VirtualDomain_status is hanging, then the "timeout loop" (after which force_stop will be called) does not finish before the cluster times out the operation.
The tijmeout code (shutdown_timeout=$(( $NOW + ($OCF_RESKEY_CRM_meta_timeout/1000) -5 ))) allows 5 extra seconds from the start of the RA (where NOW is set) for all the processing.
So if you spend 2 seconds until the while loop start, and you spend three more extra seconds while waiting for 5 minutes (300s), the cluster will timeout the stop before the RA makes ist final attempt.
That might be a little tight IMHO.

In contrast the older Xen RA uses 1/3rd of the timeout as safety margin:

Any splendid insights?


More information about the Users mailing list