<html><body><p>Hi..<br><br>I've been testing live guest migration (LGM) with VirtualDomain resources, which are guests running on Linux KVM / System Z <br>managed by pacemaker. <br><br>I'm looking for documentation that explains how to configure my VirtualDomain resources such that they will not timeout<br>prematurely when there is a heavy I/O workload running on the guest. <br><br>If I perform the LGM with an unmanaged guest (resource disabled), it takes anywhere from 2 - 5 minutes to complete the LGM. <br>Example: <br><br># Migrate guest, specify a timeout value of 600s <br><br>[root@zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live --persistent --undefinesource<b> --timeout 600</b> --verbose zs95kjg110061 qemu+ssh://zs90kppcs1/system<br>Mon Jan 16 16:35:32 EST 2017<br><br>Migration: [100 %]<br><br>[root@zs95kj VD]# date<br>Mon Jan 16 16:40:01 EST 2017<br>[root@zs95kj VD]#<br><br>Start: 16:35:32<br>End: 16:40:01<br>Total: <b>4 min 29 sec</b><br><br><br>In comparison, when the guest is managed by pacemaker, and enabled for LGM ... I get this: <br><br>[root@zs95kj VD]# date;pcs resource show zs95kjg110061_res<br>Mon Jan 16 15:13:33 EST 2017<br> Resource: zs95kjg110061_res (class=ocf provider=heartbeat type=VirtualDomain)<br> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml hypervisor=qemu:///system migration_transport=ssh<br> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061 remote-addr=10.20.110.61<br> Operations: start interval=0s timeout=480 (zs95kjg110061_res-start-interval-0s)<br> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)<br> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)<br> migrate-from interval=0s timeout=1200 (zs95kjg110061_res-migrate-from-interval-0s)<br> <b>migrate-to</b> interval=0s <b>timeout=1200</b> (zs95kjg110061_res-migrate-to-interval-0s) <br><br>NOTE: I didn't specify any migrate-to value for timeout, so it defaulted to 1200. Is this seconds? If so, that's 20 minutes, <br>ample time to complete a 5 minute migration. <br><br><br>[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>Mon Jan 16 14:27:01 EST 2017<br> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1<br>[root@zs95kj VD]#<br><br><br>[root@zs95kj VD]# date;<b>pcs resource move zs95kjg110061_res zs95kjpcs1</b><br>Mon Jan 16 14:45:39 EST 2017<br>You have new mail in /var/spool/mail/root<br><br><br>Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO: zs95kjg110061: <b>Starting live migration to zs95kjpcs1 (using: virsh --connect=qemu:///system --quiet migrate --live zs95kjg110061 qemu+ssh://zs95kjpcs1/system ).</b><br><font color="#FF0000">Jan 16 14:45:57 zs90kp lrmd[12798]: warning: zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out</font><br><font color="#FF0000">Jan 16 14:45:57 zs90kp lrmd[12798]: warning: zs95kjg110061_res_migrate_to_0:21050 - timed out after 20000ms</font><br><font color="#FF0000">Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978, timeout=20000ms)</font><br><font color="#FF0000">Jan 16 14:45:58 zs90kp journal: operation failed: migration job: unexpectedly failed</font><br>[root@zs90KP VD]#<br><br>So, the migration timed out after 20000ms. Assuming ms is milliseconds, that's only 20 seconds. So, it seems that LGM timeout has <br>nothing to do with <b>migrate-to</b> on the resource definition. <br><br><br>Also, what is the expected behavior when the migration times out? I watched the VirtualDomain resource state during the migration process... <br><br>[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>Mon Jan 16 14:45:57 EST 2017<br> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1<br>[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>Mon Jan 16 14:46:02 EST 2017<br> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1<br>[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>Mon Jan 16 14:46:06 EST 2017<br> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1<br>[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>Mon Jan 16 14:46:08 EST 2017<br> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1<br>[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>Mon Jan 16 14:46:10 EST 2017<br> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1<br>[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>Mon Jan 16 14:46:12 EST 2017<br> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Stopped<br>[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>Mon Jan 16 14:46:14 EST 2017<br> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1<br>[root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>Mon Jan 16 14:46:17 EST 2017<br> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1<br>[root@zs95kj VD]#<br><br><br>So, it seems as if the guest migration actually did succeed, at least the guest is running<br>on the target node (KVM host). However... I checked the<br>"blast" IO workload (writes to external, virtual storage accessible to both all cluster<br>hosts) <br><br>I can experiment with different <b>migrate-to</b> timeout value settings, but would really<br>prefer to have a good understanding of timeout configuration and recovery behavior first. <br><br>Thanks!<br><br><br>Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y.<br> INTERNET: swgreenl@us.ibm.com <br><br><BR>
</body></html>