[ClusterLabs] Antw: [EXT] Re: failed migration handled the wrong way

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Mon Feb 1 05:59:16 EST 2021


>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 01.02.2021 um 11:05 in
Nachricht
<CAA91j0V-4YzNfT-KJ1nzLE_UyEdNOoiBtUMFjaST4O8L+uX8aQ at mail.gmail.com>:
> On Mon, Feb 1, 2021 at 12:53 PM Ulrich Windl
> <Ulrich.Windl at rz.uni‑regensburg.de> wrote:
...
>> Feb 01 10:33:08 h16 pacemaker‑execd[7464]:  notice: 
> prm_xen_test‑jeos5_stop_0[33137] error output [ error: intern               

>                                                   al error: Failed to 
> shutdown domain '13' with libxenlight ]
>> Feb 01 10:33:08 h16 pacemaker‑execd[7464]:  notice: 
> prm_xen_test‑jeos5_stop_0[33137] error output [  ]
>> Feb 01 10:33:08 h16 pacemaker‑execd[7464]:  notice: prm_xen_test‑jeos5 stop

> (call 230, PID 33137) exited with sta                                       
 
>                          tus 0 (execution time 177112ms, queue time 0ms)
>>
>> ### Shouldn't the result be error?
>>
> 
> If domain remained active, I would say yes. But do not forget that
> failure to stop resources by default will kill the node.

In fact "virsh list" still listed the domain, but the cluster had destroyed
the image (once again).
Trying a "restart" of the VM, actually resulted in the node being fenced.

> 
>> Fortunately locking prevented duplicate activation of h18:
>> Feb 01 10:32:51 h18 systemd[1]: Started Virtualization daemon.
>> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: 
> Lockspace for path /var/lib/libvirt/lockd/                                  
 
>                               files already exists
>> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: 
> Lockspace for path /var/lib/libvirt/lockd/                                  
 
>                               lvmvolumes already exists
>> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: 
> Lockspace for path /var/lib/libvirt/lockd/                                  
 
>                               scsivolumes already exists
>>
>> So the main issue seems that a failed forced stop returned "success", 
> causing a "recoveer" on h18 while the VM still runs on h16.
> 
> No, "recover" was caused by failure to migrate. You told pacemaker
> that you now want this VM on another host, and your wish was its
> command ‑ it attempted to fulfill it. It obviously needed to stop VM
> on its current host before trying to (re‑)start on a new home.

But the VM *wasn't* stopped on h16!

> 
>>
>> h16:~ # rpm ‑qf /usr/lib/ocf/resource.d/heartbeat/VirtualDomain
>> resource‑agents‑4.4.0+git57.70549516‑3.12.1.x86_64
>>
>> (SLES15 SP2)
>>
>> Regards,
>> Ulrich
>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 





More information about the Users mailing list