[ClusterLabs] Antw: Another word of warning regarding VirtualDomain and Live Migration
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Wed Dec 16 04:06:24 EST 2020
Hi!
(I changed the subject of the thread)
VirtualDomain seems to be broken, as it does not handle a failed live-,igration correctly:
With my test-VM running on node h16, this happened when I tried to move it away (for testing):
Dec 16 09:28:46 h19 pacemaker-schedulerd[4427]: notice: * Migrate prm_xen_test-jeos ( h16 -> h19 )
Dec 16 09:28:46 h19 pacemaker-controld[4428]: notice: Initiating migrate_to operation prm_xen_test-jeos_migrate_to_0 on h16
Dec 16 09:28:47 h19 pacemaker-controld[4428]: notice: Transition 840 aborted by operation prm_xen_test-jeos_migrate_to_0 'modify' on h16: Event failed
Dec 16 09:28:47 h19 pacemaker-controld[4428]: notice: Transition 840 action 115 (prm_xen_test-jeos_migrate_to_0 on h16): expected 'ok' but got 'error'
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]: warning: Unexpected result (error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]: warning: Unexpected result (error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
### (note the message above is duplicate!)
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]: error: Resource prm_xen_test-jeos is active on 2 nodes (attempting recovery)
### This is nonsense after a failed live migration!
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]: notice: * Recover prm_xen_test-jeos ( h19 )
So the cluster is exactly doing the wrong thing: The VM ist still active on h16, while a "recovery" on h19 will start it there! So _after_ the recovery the VM is duplicate.
Dec 16 09:28:47 h19 pacemaker-controld[4428]: notice: Initiating stop operation prm_xen_test-jeos_stop_0 locally on h19
Dec 16 09:28:47 h19 VirtualDomain(prm_xen_test-jeos)[20656]: INFO: Domain test-jeos already stopped.
Dec 16 09:28:47 h19 pacemaker-execd[4425]: notice: prm_xen_test-jeos stop (call 372, PID 20620) exited with status 0 (execution time 283ms, queue time 0ms)
Dec 16 09:28:47 h19 pacemaker-controld[4428]: notice: Result of stop operation for prm_xen_test-jeos on h19: ok
Dec 16 09:31:45 h19 pacemaker-controld[4428]: notice: Initiating start operation prm_xen_test-jeos_start_0 locally on h19
Dec 16 09:31:47 h19 pacemaker-execd[4425]: notice: prm_xen_test-jeos start (call 373, PID 21005) exited with status 0 (execution time 2715ms, queue time 0ms)
Dec 16 09:31:47 h19 pacemaker-controld[4428]: notice: Result of start operation for prm_xen_test-jeos on h19: ok
Dec 16 09:33:46 h19 pacemaker-schedulerd[4427]: warning: Unexpected result (error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
Amazingly manual migration using virsh worked:
virsh migrate --live test-jeos xen+tls://h18...
Regards,
Ulrich Windl
>>> Ulrich Windl schrieb am 14.12.2020 um 15:21 in Nachricht <5FD774CF.8DE : 161 :
60728>:
> Hi!
>
> I think I found the problem why a VM ist started on two nodes:
>
> Live-Migration had failed (e.g. away from h16), so the cluster uses stop and
> start (stop on h16, start on h19 for example).
> When rebooting h16, I see these messages (h19 is DC):
>
> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]: warning: Unexpected result
> (error: test-jeos: live migration to h16 failed: 1) was recorded for
> migrate_to of prm_xen_test-jeos on h19 at Dec 14 11:54:08 2020
> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]: error: Resource
> prm_xen_test-jeos is active on 2 nodes (attempting recovery)
>
> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]: notice: * Restart
> prm_xen_test-jeos ( h16 )
>
> THIS IS WRONG: h16 was booted, so no VM is running on h16 (unless there was
> some autostart from libvirt. " virsh list --autostart" does not list any)
>
> Dec 14 15:09:27 h16 VirtualDomain(prm_xen_test-jeos)[4850]: INFO: Domain
> test-jeos already stopped.
>
> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]: error: Calculated
> transition 669 (with errors), saving inputs in
> /var/lib/pacemaker/pengine/pe-error-4.bz2
>
> Whhat's going on here?
>
> Regards,
> Ulrich
>
> >>> Ulrich Windl schrieb am 14.12.2020 um 08:15 in Nachricht <5FD7110D.D09 : 161
> :
> 60728>:
> > Hi!
> >
> > Another word of warning regarding VirtualDomain: While configuring a 3-node
>
> > cluster with SLES15 SP2 for Xen PVM (using libvirt and the VirtaulDOmain
> RA),
> > I had created a TestVM using BtrFS.
> > At some time of testing the cluster ended with the testVM running on more
> > than one node (for reasons still to examine). Only after a "crm resource
> > refresh" (rebprobe) the cluster tried to fix the problem.
> > Well at some point the VM wouldn't start any more, because the BtrFS used
> > for all (SLES default) was corrupted in a way that seems unrecoverable,
> > independenlty of how many subvolumes and snapshots of those may exist.
> >
> > Initially I would guess the libvirt stack and VirtualDomain is less
> reliable
> > than the old Xen method and RA.
> >
> > Regards,
> > Ulrich
> >
> >
> >
>
>
>
>
More information about the Users
mailing list