[ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

Lentes, Bernd bernd.lentes at helmholtz-muenchen.de
Fri Oct 23 17:16:57 EDT 2020



----- On Oct 23, 2020, at 8:45 PM, Valentin Vidić vvidic at valentin-vidic.from.hr wrote:

> On Fri, Oct 23, 2020 at 08:08:31PM +0200, Lentes, Bernd wrote:
>> But when the timeout has run out the RA tries to kill the machine with a "virsh
>> destroy".
>> And if that does not work (what is occasionally my problem) because the domain
>> is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back which
>> cause pacemaker to fence the lazy node. Or am i wrong ?
> 
> What does the log look like when this happens?
> 

/var/log/cluster/corosync.log:

VirtualDomain(vm_amok)[8998]:   2020/09/27_22:34:11 INFO: Issuing graceful shutdown request for domain vm_amok.

VirtualDomain(vm_amok)[8998]:   2020/09/27_22:37:06 INFO: Issuing forced shutdown (destroy) request for domain vm_amok.
Sep 27 22:37:11 [11282] ha-idg-2       lrmd:  warning: child_timeout_callback:  vm_amok_stop_0 process (PID 8998) timed out
Sep 27 22:37:11 [11282] ha-idg-2       lrmd:  warning: operation_finished:      vm_amok_stop_0:8998 - timed out after 180000ms
  timeout of the domain is 180 sec.

/var/log/libvirt/libvirtd.log (time is UTC):

2020-09-27 20:37:21.489+0000: 18583: error : virProcessKillPainfully:401 : Failed to terminate process 14037 with SIGKILL: Device or resource busy
2020-09-27 20:37:21.505+0000: 6610: error : virNetSocketWriteWire:1852 : Cannot write data: Broken pipe
2020-09-27 20:37:31.962+0000: 6610: error : qemuMonitorIO:719 : internal error: End of file from qemu monitor

SIGKILL didn't work. Nevertheless the process is finished 20 seconds later after destroy, surely because it woke up from D and received the signal.

/var/log/cluster/corosync.log on the DC:

Sep 27 22:37:11 [3580] ha-idg-1       crmd:  warning: status_from_rc:   Action 93 (vm_amok_stop_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
  Stop (also sigkill) failed
Sep 27 22:37:11 [3579] ha-idg-1    pengine:   notice: native_stop_constraints:  Stop of failed resource vm_amok is implicit after ha-idg-2 is fenced
  cluster decides to fence the node although resource is stopped 10 seconds later

atop log:
14037      - S 261% /usr/bin/qemu-system-x86_64 -machine accel=kvm -name guest=vm_amok,debug-threads=on -S -object secret,id=masterKey0 ...
  PID of the domain is 14037

14037      - E   0% worker   (at 22:37:31)
  domain has stoppped


Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671



More information about the Users mailing list