[ClusterLabs] Antw: [EXT] Re: VirtualDomain does not stop via "crm resource stop" - modify RA ?

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Mon Oct 26 03:41:10 EDT 2020


>>> "Lentes, Bernd" <bernd.lentes at helmholtz-muenchen.de> schrieb am 23.10.2020
um
23:16 in Nachricht
<1814448122.1773393.1603487817751.JavaMail.zimbra at helmholtz-muenchen.de>:

> 
> ----- On Oct 23, 2020, at 8:45 PM, Valentin Vidić 
> vvidic at valentin-vidic.from.hr wrote:
> 
>> On Fri, Oct 23, 2020 at 08:08:31PM +0200, Lentes, Bernd wrote:
>>> But when the timeout has run out the RA tries to kill the machine with a 
> "virsh
>>> destroy".
>>> And if that does not work (what is occasionally my problem) because the 
> domain
>>> is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back

> which
>>> cause pacemaker to fence the lazy node. Or am i wrong ?
>> 
>> What does the log look like when this happens?
>> 
> 
> /var/log/cluster/corosync.log:
> 
> VirtualDomain(vm_amok)[8998]:   2020/09/27_22:34:11 INFO: Issuing graceful 
> shutdown request for domain vm_amok.
> 
> VirtualDomain(vm_amok)[8998]:   2020/09/27_22:37:06 INFO: Issuing forced 
> shutdown (destroy) request for domain vm_amok.
> Sep 27 22:37:11 [11282] ha-idg-2       lrmd:  warning: 
> child_timeout_callback:  vm_amok_stop_0 process (PID 8998) timed out
> Sep 27 22:37:11 [11282] ha-idg-2       lrmd:  warning: operation_finished:  

>    vm_amok_stop_0:8998 - timed out after 180000ms
>   timeout of the domain is 180 sec.
> 
> /var/log/libvirt/libvirtd.log (time is UTC):
> 
> 2020-09-27 20:37:21.489+0000: 18583: error : virProcessKillPainfully:401 : 
> Failed to terminate process 14037 with SIGKILL: Device or resource busy

"SIGKILL: Device or resource busy" is nonsense: kill does not wait; it either
fails or succeeds.

> 2020-09-27 20:37:21.505+0000: 6610: error : virNetSocketWriteWire:1852 : 
> Cannot write data: Broken pipe
> 2020-09-27 20:37:31.962+0000: 6610: error : qemuMonitorIO:719 : internal 
> error: End of file from qemu monitor
> 
> SIGKILL didn't work. Nevertheless the process is finished 20 seconds later 
> after destroy, surely because it woke up from D and received the signal.
> 
> /var/log/cluster/corosync.log on the DC:
> 
> Sep 27 22:37:11 [3580] ha-idg-1       crmd:  warning: status_from_rc:   
> Action 93 (vm_amok_stop_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error
>   Stop (also sigkill) failed
> Sep 27 22:37:11 [3579] ha-idg-1    pengine:   notice: 
> native_stop_constraints:  Stop of failed resource vm_amok is implicit after

> ha-idg-2 is fenced
>   cluster decides to fence the node although resource is stopped 10 seconds

> later
> 
> atop log:
> 14037      - S 261% /usr/bin/qemu-system-x86_64 -machine accel=kvm -name 
> guest=vm_amok,debug-threads=on -S -object secret,id=masterKey0 ...
>   PID of the domain is 14037
> 
> 14037      - E   0% worker   (at 22:37:31)
>   domain has stoppped
> 
> 
> Bernd
> Helmholtz Zentrum München
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de 
> Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin 
> Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 





More information about the Users mailing list