[ClusterLabs] Antw: [EXT] Re: Antw: Another word of warning regarding VirtualDomain and Live Migration

Wed Dec 16 08:53:43 EST 2020

>>> Roger Zhou <zzhou at suse.com> schrieb am 16.12.2020 um 13:58 in Nachricht
<8ab80ef4-462c-421b-09b8-084d270d4175 at suse.com>:

> On 12/16/20 5:06 PM, Ulrich Windl wrote:
>> Hi!
>> 
>> (I changed the subject of the thread)
>> VirtualDomain seems to be broken, as it does not handle a failed 
> live-,igration correctly:
>> 
>> With my test-VM running on node h16, this happened when I tried to move it 
> away (for testing):
>> 
>> Dec 16 09:28:46 h19 pacemaker-schedulerd[4427]:  notice:  * Migrate    
> prm_xen_test-jeos                    ( h16 -> h19 )
>> Dec 16 09:28:46 h19 pacemaker-controld[4428]:  notice: Initiating migrate_to 
> operation prm_xen_test-jeos_migrate_to_0 on h16
>> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 
> aborted by operation prm_xen_test-jeos_migrate_to_0 'modify' on h16: Event 
> failed
> 
> RA migration_to failed quickly. Maybe the configuration is not perfect 
> enough?

Probably you are right, but shouldn't the reason for the failure play any part in the decision how to handle it? Specifically the case when the node was just booted, and the cluster claimed a VM would be running there.
THere is no problem with one VM running on two nodes UNTIL pacemaker decides to "recover" from it. THEN one VM is running on two nodes.

For reference, I had configured "xen+tls" and I can live-migrate the VMs manually using "virsh". The RA config basically is:

primitive prm_xen_test-jeos VirtualDomain \
        params param config="/etc/libvirt/libxl/test-jeos.xml" hypervisor="xen:///system" autoset_utilization_cpu=false autoset_utilization_hv_memory=false \
        op start timeout=120 interval=0 \
        op stop timeout=180 interval=0 \
        op monitor interval=600 timeout=90 \
        op migrate_to timeout=300 interval=0 \
        op migrate_from timeout=300 interval=0 \
        utilization utl_cpu=20 utl_ram=2048 \
        meta priority=123 allow-migrate=true

> 
> How about enable trace, and collect more RA logs to check what exactly virsh 
> 
> command used and check if it works manually
> 
> `crm resource trace prm_xen_test-jeos`

BTW:
h16:~ # crm resource trace prm_xen_test-jeos
INFO: Trace for prm_xen_test-jeos is written to /var/lib/heartbeat/trace_ra/
INFO: Trace set, restart prm_xen_test-jeos to trace non-monitor operations
h16:~ # ll /var/lib/heartbeat/trace_ra/
ls: cannot access '/var/lib/heartbeat/trace_ra/': No such file or directory

h16:~ # virsh list
 Id   Name        State
---------------------------
 0    Domain-0    running
 4    test-jeos   running
h16:~ # crm resource move prm_xen_test-jeos PT5M force
Migration will take effect until: 2020-12-16 14:17:28 +01:00
INFO: Move constraint created for prm_xen_test-jeos

I could not find the trace information, but I found this in syslog:
Dec 16 14:17:13 h19 VirtualDomain(prm_xen_test-jeos)[20077]: INFO: test-jeos: Starting live migration to h18 (using: virsh --connect=xen:///system --quiet migrate --live  test-jeos xen://h18/system ).

I guess "xen://h18/system" should be "xen+tls://h18/system" here. In fact I verified it interactively. Due to the certificate subject the FQHN has to be used, too; otherwise I see "warning : virNetTLSContextCheckCertificate:1082 : Certificate check failed Certificate [session] owner does not match the hostname h18".

Regards,
Ulrich

> 
> 
>> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 action 
> 115 (prm_xen_test-jeos_migrate_to_0 on h16): expected 'ok' but got 'error'
>> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
> (error: test-jeos: live migration to h19 failed: 1) was recorded for 
> migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
>> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
> (error: test-jeos: live migration to h19 failed: 1) was recorded for 
> migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
>> ### (note the message above is duplicate!)
>> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  error: Resource 
> prm_xen_test-jeos is active on 2 nodes (attempting recovery)
>> ### This is nonsense after a failed live migration!
> 
> Indeed, sounds like an valid improvement for pacemaker-schedulerd? Or, 
> articulate what to do with the migration_to fails. I couldn't find the 
> definition from any doc yet.
> 
>> Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  notice:  * Recover    
> prm_xen_test-jeos                    (             h19 )
>> 
>> 
>> So the cluster is exactly doing the wrong thing: The VM ist still active on 
> h16, while a "recovery" on h19 will start it there! So _after_ the recovery 
> the VM is duplicate.
>> 
>> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Initiating stop 
> operation prm_xen_test-jeos_stop_0 locally on h19
>> Dec 16 09:28:47 h19 VirtualDomain(prm_xen_test-jeos)[20656]: INFO: Domain 
> test-jeos already stopped.
>> Dec 16 09:28:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos stop 
> (call 372, PID 20620) exited with status 0 (execution time 283ms, queue time 
> 0ms)
>> Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Result of stop 
> operation for prm_xen_test-jeos on h19: ok
>> Dec 16 09:31:45 h19 pacemaker-controld[4428]:  notice: Initiating start 
> operation prm_xen_test-jeos_start_0 locally on h19
>> 
>> Dec 16 09:31:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos start 
> (call 373, PID 21005) exited with status 0 (execution time 2715ms, queue time 
> 0ms)
>> Dec 16 09:31:47 h19 pacemaker-controld[4428]:  notice: Result of start 
> operation for prm_xen_test-jeos on h19: ok
>> Dec 16 09:33:46 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
> (error: test-jeos: live migration to h19 failed: 1) was recorded for 
> migrate_to of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
>> 
> 
> yeah, schedulerd is trying so hard to report the migration_to failure here!
> 
>> Amazingly manual migration using virsh worked:
>> virsh migrate --live test-jeos xen+tls://h18...
>> 
> 
> What about s/h18/h19/?
> 
> Or, manually reproduce exactly as the RA code:
> 
> `virsh ${VIRSH_OPTIONS} migrate --live $migrate_opts $DOMAIN_NAME $remoteuri 
> 
> $migrateuri`
> 
> 
> Good luck!
> Roger
> 
> 
>> Regards,
>> Ulrich Windl
>> 
>> 
>>>>> Ulrich Windl schrieb am 14.12.2020 um 15:21 in Nachricht <5FD774CF.8DE : 161 
> :
>> 60728>:
>>> Hi!
>>>
>>> I think I found the problem why a VM ist started on two nodes:
>>>
>>> Live-Migration had failed (e.g. away from h16), so the cluster uses stop and
>>> start (stop on h16, start on h19 for example).
>>> When rebooting h16, I see these messages (h19 is DC):
>>>
>>> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result
>>> (error: test-jeos: live migration to h16 failed: 1) was recorded for
>>> migrate_to of prm_xen_test-jeos on h19 at Dec 14 11:54:08 2020
>>> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Resource
>>> prm_xen_test-jeos is active on 2 nodes (attempting recovery)
>>>
>>> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  notice:  * Restart
>>> prm_xen_test-jeos                    (             h16 )
>>>
>>> THIS IS WRONG: h16 was booted, so no VM is running on h16 (unless there was
>>> some autostart from libvirt. " virsh list --autostart" does not list any)
>>>
>>> Dec 14 15:09:27 h16 VirtualDomain(prm_xen_test-jeos)[4850]: INFO: Domain
>>> test-jeos already stopped.
>>>
>>> Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Calculated
>>> transition 669 (with errors), saving inputs in
>>> /var/lib/pacemaker/pengine/pe-error-4.bz2
>>>
>>> Whhat's going on here?
>>>
>>> Regards,
>>> Ulrich
>>>
>>>>>> Ulrich Windl schrieb am 14.12.2020 um 08:15 in Nachricht <5FD7110D.D09 : 161
>>> :
>>> 60728>:
>>>> Hi!
>>>>
>>>> Another word of warning regarding VirtualDomain: While configuring a 3-node
>>>
>>>> cluster with SLES15 SP2 for Xen PVM (using libvirt and the VirtaulDOmain
>>> RA),
>>>> I had created a TestVM using BtrFS.
>>>> At some time of testing the cluster ended with the testVM running on more
>>>> than one node (for reasons still to examine). Only after a "crm resource
>>>> refresh" (rebprobe) the cluster tried to fix the problem.
>>>> Well at some point the VM wouldn't start any more, because the BtrFS used
>>>> for all (SLES default) was corrupted in a way that seems unrecoverable,
>>>> independenlty of how many subvolumes and snapshots of those may exist.
>>>>
>>>> Initially I would guess the libvirt stack and VirtualDomain is less
>>> reliable
>>>> than the old Xen method and RA.
>>>>
>>>> Regards,
>>>> Ulrich
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>> 
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>