[ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
Lentes, Bernd
bernd.lentes at helmholtz-muenchen.de
Fri Oct 23 17:16:57 EDT 2020
----- On Oct 23, 2020, at 8:45 PM, Valentin Vidić vvidic at valentin-vidic.from.hr wrote:
> On Fri, Oct 23, 2020 at 08:08:31PM +0200, Lentes, Bernd wrote:
>> But when the timeout has run out the RA tries to kill the machine with a "virsh
>> destroy".
>> And if that does not work (what is occasionally my problem) because the domain
>> is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back which
>> cause pacemaker to fence the lazy node. Or am i wrong ?
>
> What does the log look like when this happens?
>
/var/log/cluster/corosync.log:
VirtualDomain(vm_amok)[8998]: 2020/09/27_22:34:11 INFO: Issuing graceful shutdown request for domain vm_amok.
VirtualDomain(vm_amok)[8998]: 2020/09/27_22:37:06 INFO: Issuing forced shutdown (destroy) request for domain vm_amok.
Sep 27 22:37:11 [11282] ha-idg-2 lrmd: warning: child_timeout_callback: vm_amok_stop_0 process (PID 8998) timed out
Sep 27 22:37:11 [11282] ha-idg-2 lrmd: warning: operation_finished: vm_amok_stop_0:8998 - timed out after 180000ms
timeout of the domain is 180 sec.
/var/log/libvirt/libvirtd.log (time is UTC):
2020-09-27 20:37:21.489+0000: 18583: error : virProcessKillPainfully:401 : Failed to terminate process 14037 with SIGKILL: Device or resource busy
2020-09-27 20:37:21.505+0000: 6610: error : virNetSocketWriteWire:1852 : Cannot write data: Broken pipe
2020-09-27 20:37:31.962+0000: 6610: error : qemuMonitorIO:719 : internal error: End of file from qemu monitor
SIGKILL didn't work. Nevertheless the process is finished 20 seconds later after destroy, surely because it woke up from D and received the signal.
/var/log/cluster/corosync.log on the DC:
Sep 27 22:37:11 [3580] ha-idg-1 crmd: warning: status_from_rc: Action 93 (vm_amok_stop_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
Stop (also sigkill) failed
Sep 27 22:37:11 [3579] ha-idg-1 pengine: notice: native_stop_constraints: Stop of failed resource vm_amok is implicit after ha-idg-2 is fenced
cluster decides to fence the node although resource is stopped 10 seconds later
atop log:
14037 - S 261% /usr/bin/qemu-system-x86_64 -machine accel=kvm -name guest=vm_amok,debug-threads=on -S -object secret,id=masterKey0 ...
PID of the domain is 14037
14037 - E 0% worker (at 22:37:31)
domain has stoppped
Bernd
Helmholtz Zentrum München
Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671
More information about the Users
mailing list