[ClusterLabs] Antw: Another word of warning regarding VirtualDomain and Live Migration

Wed Dec 16 07:58:57 EST 2020

On 12/16/20 5:06 PM, Ulrich Windl wrote:
> Hi!
> 
> (I changed the subject of the thread)
> VirtualDomain seems to be broken, as it does not handle a failed live-,igration correctly:
> 
> With my test-VM running on node h16, this happened when I tried to move it away (for testing):
> 
> Dec 16 09:28:46 h19 pacemaker-schedulerd[4427]:  notice:  * Migrate    prm_xen_test-jeos                    ( h16 -> h19 )
> Dec 16 09:28:46 h19 pacemaker-controld[4428]:  notice: Initiating migrate_to operation prm_xen_test-jeos_migrate_to_0 on h16
> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 aborted by operation prm_xen_test-jeos_migrate_to_0 'modify' on h16: Event failed

RA migration_to failed quickly. Maybe the configuration is not perfect enough?

How about enable trace, and collect more RA logs to check what exactly virsh 
command used and check if it works manually

`crm resource trace prm_xen_test-jeos`

> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 action 115 (prm_xen_test-jeos_migrate_to_0 on h16): expected 'ok' but got 'error'
> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result (error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result (error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
> ### (note the message above is duplicate!)
> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  error: Resource prm_xen_test-jeos is active on 2 nodes (attempting recovery)
> ### This is nonsense after a failed live migration!

Indeed, sounds like an valid improvement for pacemaker-schedulerd? Or, 
articulate what to do with the migration_to fails. I couldn't find the 
definition from any doc yet.

> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  notice:  * Recover    prm_xen_test-jeos                    (             h19 )
> 
> 
> So the cluster is exactly doing the wrong thing: The VM ist still active on h16, while a "recovery" on h19 will start it there! So _after_ the recovery the VM is duplicate.
> 
> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Initiating stop operation prm_xen_test-jeos_stop_0 locally on h19
> Dec 16 09:28:47 h19 VirtualDomain(prm_xen_test-jeos)[20656]: INFO: Domain test-jeos already stopped.
> Dec 16 09:28:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos stop (call 372, PID 20620) exited with status 0 (execution time 283ms, queue time 0ms)
> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Result of stop operation for prm_xen_test-jeos on h19: ok
> Dec 16 09:31:45 h19 pacemaker-controld[4428]:  notice: Initiating start operation prm_xen_test-jeos_start_0 locally on h19
> 
> Dec 16 09:31:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos start (call 373, PID 21005) exited with status 0 (execution time 2715ms, queue time 0ms)
> Dec 16 09:31:47 h19 pacemaker-controld[4428]:  notice: Result of start operation for prm_xen_test-jeos on h19: ok
> Dec 16 09:33:46 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result (error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
> 

yeah, schedulerd is trying so hard to report the migration_to failure here!

> Amazingly manual migration using virsh worked:
> virsh migrate --live test-jeos xen+tls://h18...
> 

What about s/h18/h19/?

Or, manually reproduce exactly as the RA code:

`virsh ${VIRSH_OPTIONS} migrate --live $migrate_opts $DOMAIN_NAME $remoteuri 
$migrateuri`

Good luck!
Roger

> Regards,
> Ulrich Windl
> 
> 
>>>> Ulrich Windl schrieb am 14.12.2020 um 15:21 in Nachricht <5FD774CF.8DE : 161 :
> 60728>:
>> Hi!
>>
>> I think I found the problem why a VM ist started on two nodes:
>>
>> Live-Migration had failed (e.g. away from h16), so the cluster uses stop and
>> start (stop on h16, start on h19 for example).
>> When rebooting h16, I see these messages (h19 is DC):
>>
>> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result
>> (error: test-jeos: live migration to h16 failed: 1) was recorded for
>> migrate_to of prm_xen_test-jeos on h19 at Dec 14 11:54:08 2020
>> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Resource
>> prm_xen_test-jeos is active on 2 nodes (attempting recovery)
>>
>> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  notice:  * Restart
>> prm_xen_test-jeos                    (             h16 )
>>
>> THIS IS WRONG: h16 was booted, so no VM is running on h16 (unless there was
>> some autostart from libvirt. " virsh list --autostart" does not list any)
>>
>> Dec 14 15:09:27 h16 VirtualDomain(prm_xen_test-jeos)[4850]: INFO: Domain
>> test-jeos already stopped.
>>
>> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Calculated
>> transition 669 (with errors), saving inputs in
>> /var/lib/pacemaker/pengine/pe-error-4.bz2
>>
>> Whhat's going on here?
>>
>> Regards,
>> Ulrich
>>
>>>>> Ulrich Windl schrieb am 14.12.2020 um 08:15 in Nachricht <5FD7110D.D09 : 161
>> :
>> 60728>:
>>> Hi!
>>>
>>> Another word of warning regarding VirtualDomain: While configuring a 3-node
>>
>>> cluster with SLES15 SP2 for Xen PVM (using libvirt and the VirtaulDOmain
>> RA),
>>> I had created a TestVM using BtrFS.
>>> At some time of testing the cluster ended with the testVM running on more
>>> than one node (for reasons still to examine). Only after a "crm resource
>>> refresh" (rebprobe) the cluster tried to fix the problem.
>>> Well at some point the VM wouldn't start any more, because the BtrFS used
>>> for all (SLES default) was corrupted in a way that seems unrecoverable,
>>> independenlty of how many subvolumes and snapshots of those may exist.
>>>
>>> Initially I would guess the libvirt stack and VirtualDomain is less
>> reliable
>>> than the old Xen method and RA.
>>>
>>> Regards,
>>> Ulrich
>>>
>>>
>>>
>>
>>
>>
>>
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>