[ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

Ken Gaillot kgaillot at redhat.com
Tue Jan 17 11:40:54 EST 2017


On 01/17/2017 10:19 AM, Scott Greenlese wrote:
> Hi..
> 
> I've been testing live guest migration (LGM) with VirtualDomain
> resources, which are guests running on Linux KVM / System Z
> managed by pacemaker.
> 
> I'm looking for documentation that explains how to configure my
> VirtualDomain resources such that they will not timeout
> prematurely when there is a heavy I/O workload running on the guest.
> 
> If I perform the LGM with an unmanaged guest (resource disabled), it
> takes anywhere from 2 - 5 minutes to complete the LGM.
> Example:
> 
> # Migrate guest, specify a timeout value of 600s
> 
> [root at zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live
> --persistent --undefinesource*--timeout 600* --verbose zs95kjg110061
> qemu+ssh://zs90kppcs1/system
> Mon Jan 16 16:35:32 EST 2017
> 
> Migration: [100 %]
> 
> [root at zs95kj VD]# date
> Mon Jan 16 16:40:01 EST 2017
> [root at zs95kj VD]#
> 
> Start: 16:35:32
> End: 16:40:01
> Total: *4 min 29 sec*
> 
> 
> In comparison, when the guest is managed by pacemaker, and enabled for
> LGM ... I get this:
> 
> [root at zs95kj VD]# date;pcs resource show zs95kjg110061_res
> Mon Jan 16 15:13:33 EST 2017
> Resource: zs95kjg110061_res (class=ocf provider=heartbeat
> type=VirtualDomain)
> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
> hypervisor=qemu:///system migration_transport=ssh
> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
> remote-addr=10.20.110.61
> Operations: start interval=0s timeout=480
> (zs95kjg110061_res-start-interval-0s)
> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
> migrate-from interval=0s timeout=1200
> (zs95kjg110061_res-migrate-from-interval-0s)
> *migrate-to* interval=0s *timeout=1200*
> (zs95kjg110061_res-migrate-to-interval-0s)
> 
> NOTE: I didn't specify any migrate-to value for timeout, so it defaulted
> to 1200. Is this seconds? If so, that's 20 minutes,
> ample time to complete a 5 minute migration.

Not sure where the default of 1200 comes from, but I believe the default
is milliseconds if no unit is specified. Normally you'd specify
something like "timeout=1200s".

> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:27:01 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
> [root at zs95kj VD]#
> 
> 
> [root at zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1*
> Mon Jan 16 14:45:39 EST 2017
> You have new mail in /var/spool/mail/root
> 
> 
> Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO:
> zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh
> --connect=qemu:///system --quiet migrate --live zs95kjg110061
> qemu+ssh://zs95kjpcs1/system ).*
> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
> zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out
> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
> zs95kjg110061_res_migrate_to_0:21050 - timed out after 20000ms
> Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation
> zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978,
> timeout=20000ms)
> Jan 16 14:45:58 zs90kp journal: operation failed: migration job:
> unexpectedly failed
> [root at zs90KP VD]#
> 
> So, the migration timed out after 20000ms. Assuming ms is milliseconds,
> that's only 20 seconds. So, it seems that LGM timeout has
> nothing to do with *migrate-to* on the resource definition.

Yes, ms is milliseconds. Pacemaker internally represents all times in
milliseconds, even though in most actual usage, it has 1-second granularity.

If your specified timeout is 1200ms, I'm not sure why it's using
20000ms. There may be a minimum enforced somewhere.

> Also, what is the expected behavior when the migration times out? I
> watched the VirtualDomain resource state during the migration process...
> 
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:45:57 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:02 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:06 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:08 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:10 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:12 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Stopped
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:14 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:17 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
> [root at zs95kj VD]#
> 
> 
> So, it seems as if the guest migration actually did succeed, at least
> the guest is running
> on the target node (KVM host). However... I checked the

Failure handling is configurable, but by default, if a live migration
fails, the cluster will do a full restart (= full stop then start). So
basically, it turns from a live migration to a cold migration.

> "blast" IO workload (writes to external, virtual storage accessible to
> both all cluster
> hosts)
> 
> I can experiment with different *migrate-to* timeout value settings, but
> would really
> prefer to have a good understanding of timeout configuration and
> recovery behavior first.
> 
> Thanks!
> 
> 
> Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y.
> INTERNET: swgreenl at us.ibm.com




More information about the Users mailing list