[ClusterLabs] Antw: [EXT] Re: Q: Prevent non‑live VM migration

Tue Jul 13 04:31:14 EDT 2021

>>> <kgaillot at redhat.com> schrieb am 12.07.2021 um 16:53 in Nachricht
<376475adc8217a97adf9374d0aad0317eabd5f90.camel at redhat.com>:
> On Mon, 2021‑07‑12 at 08:35 +0200, Ulrich Windl wrote:
>> Hi!
>> 
>> We had some problem in the cluster that prevented live migration of
>> VMs. As a consequence the cluster migrated the VMs using stop/start.

For completeness, it tuned out that the libvirt install script eats up the
"--listen" parameter that is required when starting libvirt from the cluster:
# The '--listen' option is incompatible with socket activation.
# We need to forcibly remove it from /etc/sysconfig/libvirtd.
# Also add the --timeout option to be consistent with upstream.
# See boo#1156161 for details
sed -i -e '/^\s*LIBVIRTD_ARGS=/s/--listen//g' /etc/sysconfig/libvirtd
if ! grep -q -E '^\s*LIBVIRTD_ARGS=.*--timeout' /etc/sysconfig/libvirtd ;
then
    sed -i 's/^\s*LIBVIRTD_ARGS="\(.*\)"/LIBVIRTD_ARGS="\1 --timeout 120"/'
/etc/sysconfig/libvirtd

Another set of tricks:
So adding "--listen" again and restarting vlibvirtd.service almost fixed the
problem:
While it's OK to restart libvirtd while VMs are running, there were stale
locks (virtlockd) that were tricky to clean up.

Most specifically messages like these aren't really helpful (What the hack
does that lock refer to?):
Jul 13 10:03:11 h16 virtlockd[8935]: resource busy: Lockspace resource
'56c8f9a7a41ce0ffaa53061ec08689fb8035ef3dbf560723103993b2dff4a15d' is locked
Jul 13 10:03:11 h16 libvirtd[22972]: resource busy: Lockspace resource
'56c8f9a7a41ce0ffaa53061ec08689fb8035ef3dbf560723103993b2dff4a15d' is locked

Even if you find that "lock" in
/var/lib/libvirt/lockd/files/56c8f9a7a41ce0ffaa53061ec08689fb8035ef3dbf560723103993b2dff4a15,
you are not more clever than before, I'm afraid ;-)

BTW: I had filed an enhancement request regarding that some time ago...

Regards,
Ulrich

>> I wonder: Is there a way to prevent stop/start migration if live‑
>> migration failed?
> 
> The only thing I can think of is setting on‑fail=block for migrate_to
> and migrate_from actions. I'd be cautious though; if the migration
> fails in a way that leaves the domain inaccessible, it will stay that
> way.
> 
>> In out case the migration was triggeerd by resource placement
>> strategy.
>> 
>> The messages logged would look like this:
>>  warning: Unexpected result (error: v15: live migration to h18
>> failed: 1) was recorded for migrate_to of prm_xen_v15 on h16
>> 
>> Regards,
>> Ulrich
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>> 
> ‑‑ 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/