[ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

Tue Jan 17 11:19:06 EST 2017

Hi..

I've been testing live guest migration (LGM) with VirtualDomain resources,
which are guests running on Linux KVM / System Z
managed by pacemaker.

I'm looking for documentation that explains how to configure my
VirtualDomain resources such that they will not timeout
prematurely when there is a heavy I/O workload running on the guest.

If I perform the LGM with an unmanaged guest (resource disabled), it takes
anywhere from 2 - 5 minutes to complete the LGM.
Example:

# Migrate guest, specify a timeout value of  600s

[root at zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live
--persistent --undefinesource --timeout 600 --verbose zs95kjg110061 qemu
+ssh://zs90kppcs1/system
Mon Jan 16 16:35:32 EST 2017

Migration: [100 %]

[root at zs95kj VD]# date
Mon Jan 16 16:40:01 EST 2017
[root at zs95kj VD]#

Start:  16:35:32
End:    16:40:01
Total:   4 min  29 sec

In comparison, when the guest is managed by pacemaker, and enabled for
LGM ... I get this:

[root at zs95kj VD]# date;pcs resource show zs95kjg110061_res
Mon Jan 16 15:13:33 EST 2017
 Resource: zs95kjg110061_res (class=ocf provider=heartbeat
type=VirtualDomain)
  Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
hypervisor=qemu:///system migration_transport=ssh
  Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
remote-addr=10.20.110.61
  Operations: start interval=0s timeout=480
(zs95kjg110061_res-start-interval-0s)
              stop interval=0s timeout=120
(zs95kjg110061_res-stop-interval-0s)
              monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
              migrate-from interval=0s timeout=1200
(zs95kjg110061_res-migrate-from-interval-0s)
              migrate-to interval=0s timeout=1200
(zs95kjg110061_res-migrate-to-interval-0s)

NOTE:  I didn't specify any migrate-to value for timeout, so it defaulted
to 1200.  Is this seconds?  If so, that's 20 minutes,
ample time to complete a 5 minute migration.

[root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:27:01 EST 2017
 zs95kjg110061_res      (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
[root at zs95kj VD]#

[root at zs95kj VD]# date;pcs resource move zs95kjg110061_res zs95kjpcs1
Mon Jan 16 14:45:39 EST 2017
You have new mail in /var/spool/mail/root

Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO:
zs95kjg110061: Starting live migration to zs95kjpcs1 (using: virsh
--connect=qemu:///system --quiet migrate --live  zs95kjg110061 qemu
+ssh://zs95kjpcs1/system ).
Jan 16 14:45:57 zs90kp lrmd[12798]: warning: zs95kjg110061_res_migrate_to_0
process (PID 21050) timed out
Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
zs95kjg110061_res_migrate_to_0:21050 - timed out after 20000ms
Jan 16 14:45:57 zs90kp crmd[12801]:   error: Operation
zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978,
timeout=20000ms)
Jan 16 14:45:58 zs90kp journal: operation failed: migration job:
unexpectedly failed
[root at zs90KP VD]#

So, the migration timed out after 20000ms.  Assuming ms is milliseconds,
that's only 20 seconds.  So, it seems that LGM timeout has
nothing to do with migrate-to on the resource definition.

Also, what is the expected behavior when the migration times out?   I
watched the VirtualDomain resource state during the migration process...

[root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:45:57 EST 2017
 zs95kjg110061_res      (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
[root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:02 EST 2017
 zs95kjg110061_res      (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
[root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:06 EST 2017
 zs95kjg110061_res      (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
[root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:08 EST 2017
 zs95kjg110061_res      (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
[root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:10 EST 2017
 zs95kjg110061_res      (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
[root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:12 EST 2017
 zs95kjg110061_res      (ocf::heartbeat:VirtualDomain): Stopped
[root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:14 EST 2017
 zs95kjg110061_res      (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
[root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
Mon Jan 16 14:46:17 EST 2017
 zs95kjg110061_res      (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
[root at zs95kj VD]#

So, it seems as if the guest migration actually did succeed, at least the
guest is running
on the target node (KVM host).  However... I checked the
"blast" IO workload (writes to external, virtual storage accessible to both
all cluster
hosts)

I can experiment with different migrate-to timeout value settings, but
would really
prefer to have a good understanding of timeout configuration and recovery
behavior first.

Thanks!

Scott Greenlese ... IBM KVM on System z - Solution Test,  Poughkeepsie,
N.Y.
  INTERNET:  swgreenl at us.ibm.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170117/4c0dcb5b/attachment-0002.html>