<html><body><p>Ken and Co, <br><br>Thanks for the useful information. <br><br>I bumped the migrate-to timeout value from 1200ms to 360s , which should be more than enough time<br>to successfully migrate the resource (i.e. the KVM guest).  The migration was again interrupted with a timeout<br>at the 20000ms (20 second) mark, thus stopping / failing the resource, which shuts down the guest, then<br>reassigning the resource to the target node, then cold starting the resource and guest on the target.  <br><br>Does anybody know where this 20000ms timeout value gets set?  20sec is clearly inadequate for a virtual machine<br>that is running with any kind or I/O or memory intensive workload.  I need to have a way to change that timeout<br>value, if not by the migrate-to timeout value, then what? <br><br> Is this internally coded within the class=ocf provider=heartbeat type=VirtualDomain  resource agent? <br><br> For completeness, here's what I did..<br><br>## Show current timeout value. <br><br>[root@zs95kj ~]# date;pcs resource show zs95kjg110061_res<br>Tue Jan 17 13:36:05 EST 2017<br> Resource: zs95kjg110061_res (class=ocf provider=heartbeat type=VirtualDomain)<br>  Attributes: config=/guestxml/nfs1/zs95kjg110061.xml hypervisor=qemu:///system migration_transport=ssh<br>  Meta Attrs: allow-migrate=true remote-node=zs95kjg110061 remote-addr=10.20.110.61 target-role=Stopped<br>  Operations: start interval=0s timeout=480 (zs95kjg110061_res-start-interval-0s)<br>              stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)<br>              monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)<br>              migrate-from interval=0s timeout=1200 (zs95kjg110061_res-migrate-from-interval-0s)<br>              migrate-to interval=0s <b><font color="#0000FF">timeout=1200</font></b> (zs95kjg110061_res-migrate-to-interval-0s)<br><br><br>## Change timeout value from 1200(ms) to 360s ...<br><br><br>[root@zs95kj ~]# pcs resource update zs95kjg110061_res op migrate-to timeout="360s"<br><br>[root@zs95kj ~]# date;pcs resource show zs95kjg110061_res<br>Tue Jan 17 13:38:10 EST 2017<br> Resource: zs95kjg110061_res (class=ocf provider=heartbeat type=VirtualDomain)<br>  Attributes: config=/guestxml/nfs1/zs95kjg110061.xml hypervisor=qemu:///system migration_transport=ssh<br>  Meta Attrs: allow-migrate=true remote-node=zs95kjg110061 remote-addr=10.20.110.61 target-role=Stopped<br>  Operations: start interval=0s timeout=480 (zs95kjg110061_res-start-interval-0s)<br>              stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)<br>              monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)<br>              migrate-from interval=0s timeout=1200 (zs95kjg110061_res-migrate-from-interval-0s)<br>              migrate-to interval=0s<b><font color="#0000FF"> timeout=360s</font></b> (zs95kjg110061_res-migrate-to-interval-0s)<br><br><br>[root@zs95kj ~]# date;pcs resource enable zs95kjg110061_res<br>Tue Jan 17 13:40:55 EST 2017<br>[root@zs95kj ~]#<br><br><br>[root@zs95kj ~]# date;pcs resource show |grep zs95kjg110061_res<br>Tue Jan 17 13:41:16 EST 2017<br> zs95kjg110061_res      (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1<br><br><br>## Started the I/O intensive 'blast' workload on the guest. <br><br><br>## Initiate the LGM via resource move CLI ..<br><br><br>[root@zs95kj ~]# date;pcs resource move zs95kjg110061_res zs90kppcs1<br>Tue Jan 17 13:54:50 EST 2017<br><br><br>## System log shows the following error thrown: <br><br>Jan 17 13:54:53 zs95kj crmd[27555]:  notice: Operation zs95kjg110061_stop_0: ok (node=zs95kjpcs1, call=44, rc=0, cib-update=450, confirmed=true)<br>Jan 17 13:54:53 zs95kj attrd[27553]:  notice: Removing all zs95kjg110061 attributes for zs95kjpcs1<br>Jan 17 13:54:54 zs95kj VirtualDomain(zs95kjg110061_res)[135045]: INFO: zs95kjg110061: <b>Starting live migration to zs90kppcs1 (using: virsh --connect=qemu:///system --quiet migrate --live  zs95kjg110061 qemu+ssh://zs90kppcs1/system ).</b><br><font color="#FF0000">Jan 17 13:55:14 zs95kj lrmd[27552]: warning: zs95kjg110061_res_migrate_to_0 process (PID 135045) timed out</font><br><font color="#FF0000">Jan 17 13:55:14 zs95kj lrmd[27552]: warning: zs95kjg110061_res_migrate_to_0:135045 - timed out after 20000ms</font><br>Jan 17 13:55:14 zs95kj crmd[27555]:   error: Operation zs95kjg110061_res_migrate_to_0: Timed Out (node=zs95kjpcs1, call=941, timeout=20000ms)<br>Jan 17 13:55:15 zs95kj VirtualDomain(zs95kjg110061_res)[136996]: INFO: Issuing graceful shutdown request for domain zs95kjg110061.<br>Jan 17 13:55:26 zs95kj systemd-machined: Machine qemu-58-zs95kjg110061 terminated.<br>Jan 17 13:55:26 zs95kj crmd[27555]:  notice: Operation zs95kjg110061_res_stop_0: ok (node=zs95kjpcs1, call=943, rc=0, cib-update=459, confirmed=true)<br><br><br>>>  This is consistent with my original symptom... the "internal" timeout value of 20000ms seems to override the migrate-to timeout value in the resource, <br>if in fact ... the migrate-to operation attribute is what would control the migration timeout in this case. <br><br><br>I happened to have full pacemaker debug enabled... so, here's generally what I see in the /var/log/cluster/pacemaker.log associated with the above migration and<br>resources associated with zs95kjg110061.  There doesn't seem to be much logged previous to the migration time out, which occurred at 13:55:14 : <br><br>Jan 17 13:54:54 [27555] zs95kj       crmd: (       lrm.c:2005  )   debug: do_lrm_rsc_op:        Stopped 0 recurring operations in preparation for zs95kjg110061_res_migrate_to_0<br>Jan 17 13:54:54 [27555] zs95kj       crmd: (       lrm.c:2009  )    info: do_lrm_rsc_op:        Performing key=337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af677dd op=zs95kjg110061_res_migrate_to_0<br>Jan 17 13:54:54 [27555] zs95kj       crmd: (lrmd_client.:817   )   trace: lrmd_send_command:    sending lrmd_rsc_exec op to lrmd<br>Jan 17 13:54:54 [27555] zs95kj       crmd: (lrmd_client.:488   )   trace: lrmd_create_op:       Sending call options: 00000004, 4<br>Jan 17 13:54:54 [27552] zs95kj       lrmd: (      lrmd.c:1644  )   trace: process_lrmd_message: Processing lrmd_rsc_exec operation from a2a90d95-a766-42ba-84a0-54574c10adee<br>Jan 17 13:54:54 [27552] zs95kj       lrmd: (      lrmd.c:347   )   trace: schedule_lrmd_cmd:    Scheduling migrate_to on zs95kjg110061_res<br>Jan 17 13:54:54 [27552] zs95kj       lrmd: (      lrmd.c:1684  )   debug: process_lrmd_message: Processed lrmd_rsc_exec operation from a2a90d95-a766-42ba-84a0-54574c10adee: rc=941, reply=1, notify=0, exit=-2147471352<br>Jan 17 13:54:54 [27552] zs95kj       lrmd: (      lrmd.c:131   )    info: log_execute:  executing - rsc:zs95kjg110061_res action:migrate_to call_id:941<br>Jan 17 13:54:54 [27552] zs95kj       lrmd: (      lrmd.c:1184  )   trace: lrmd_rsc_execute_service_lib: Creating action, resource:zs95kjg110061_res action:migrate_to class:ocf provider:heartbeat agent:VirtualDomain<br>Jan 17 13:54:54 [27555] zs95kj       crmd: (lrmd_client.:846   )   trace: lrmd_send_command:    lrmd_rsc_exec op reply received<br>Jan 17 13:54:54 [27555] zs95kj       crmd: (lrmd_client.:852   )   trace: lrmd_send_command:    Reply   <lrmd_reply lrmd_origin="send_reply" lrmd_rc="941" lrmd_callid="941"/><br>Jan 17 13:54:54 [27555] zs95kj       crmd: (lrmd_client.:300   )   trace: lrmd_dispatch_internal:       op lrmd_rsc_exec notify event received<br>Jan 17 13:54:54 [27555] zs95kj       crmd: (       lrm.c:2385  )    info: process_lrm_event:    Operation zs95kjg110061_res_monitor_30000: Cancelled (node=zs95kjpcs1, call=936, confirmed=true)<br>Jan 17 <b>13:54:54</b> [27555] zs95kj       crmd: (       lrm.c:196   )   debug: update_history_cache: Updating history for 'zs95kjg110061_res' with monitor op<br>VirtualDomain(zs95kjg110061_res)[135045]:       2017/01/17_13:54:54 DEBUG: Virtual domain zs95kjg110061 is currently running.<br>VirtualDomain(zs95kjg110061_res)[135045]:       2017/01/17_13:54:54 INFO:<b> zs95kjg110061: Starting live migration to zs90kppcs1 (using: virsh --connect=qemu:///system --quiet migrate --live  zs95kjg110061 qemu+ssh://zs90kppcs1/system ).</b><br><br>Jan 17 13:54:54 [27550] zs95kj        cib: (       xml.c:1728  )    info: cib_perform_op:       +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lr<br>m_resource[@id='zs95kjg110061_res']/lrm_rsc_op[@id='zs95kjg110061_res_last_0']:  @operation_key=zs95kjg110061_res_migrate_to_0, @operation=migrate_to, @crm-debu<br>g-origin=cib_action_update, @transition-key=337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af677dd, @transition-magic=-1:193;337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af67<br>7dd, @call-id=-1, @rc-code=193, @op-s<br><br><br>### 20 seconds after the live migration was initiated... we get the time out: <br><br>Jan 17 13:55:14 [27552] zs95kj       lrmd: (  mainloop.c:949   ) warning: child_timeout_callback:       zs95kjg110061_res_migrate_to_0 process (PID 135045) timed out<br>Jan 17 <b>13:55:14</b> [27552] zs95kj       lrmd: (services_lin:314   ) warning: operation_finished:   zs95kjg110061_res_migrate_to_0:135045 - <font color="#FF0000">timed out after 20000ms</font><br>Jan 17 13:55:14 [27552] zs95kj       lrmd: (services_lin:332   )   debug: operation_finished:   zs95kjg110061_res_migrate_to_0:135045:stderr [ -- empty -- ]<br>Jan 17 13:55:14 [27552] zs95kj       lrmd: (services_lin:336   )   debug: operation_finished:   zs95kjg110061_res_migrate_to_0:135045:stdout [ -- empty -- ]<br>Jan 17 13:55:14 [27552] zs95kj       lrmd: (      lrmd.c:592   )   trace: cmd_finalize: Resource operation rsc:zs95kjg110061_res action:migrate_to completed (0xa5f1edd0 0xa5f1edd0)<br>Jan 17 13:55:14 [27552] zs95kj       lrmd: (      lrmd.c:113   )    info: log_finished: finished - rsc:zs95kjg110061_res action:migrate_to call_id:941 pid:135045 exit-code:1 exec-time:20003ms queue-time:0ms<br>Jan 17 13:55:14 [27552] zs95kj       lrmd: (      lrmd.c:1292  )   trace: lrmd_rsc_execute:     Nothing further to do for zs95kjg110061_res<br>Jan 17 13:55:14 [27555] zs95kj       crmd: (     utils.c:1942  )   debug: create_operation_update:      do_update_resource: Updating resource zs95kjg110061_res after migrate_to op Timed Out (interval=0)<br>Jan 17 13:55:14 [27555] zs95kj       crmd: (       lrm.c:2397  )   error: process_lrm_event:    Operation zs95kjg110061_res_migrate_to_0: Timed Out (node=zs95kjpcs1, call=941, timeout=20000ms)<br>Jan 17 13:55:14 [27555] zs95kj       crmd: (       lrm.c:196   )   debug: update_history_cache: Updating history for 'zs95kjg110061_res' with migrate_to op<br><br><br>Any ideas? <br><br><br><br>Scott Greenlese ... <font face="Default Sans  Serif">IBM KVM on System z - Solution Test</font><font face="Default Sans  Serif">, </font> Poughkeepsie, N.Y.<br>  INTERNET:  swgreenl@us.ibm.com  <br>  <br><br><img width="16" height="16" src="cid:1__=8FBB0A38DFFD195C8f9e8a93df938690918c8FB@" border="0" alt="Inactive hide details for Ken Gaillot ---01/17/2017 11:41:53 AM---On 01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi.."><font color="#424282">Ken Gaillot ---01/17/2017 11:41:53 AM---On 01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi..</font><br><br><font size="2" color="#5F5F5F">From:        </font><font size="2">Ken Gaillot <kgaillot@redhat.com></font><br><font size="2" color="#5F5F5F">To:        </font><font size="2">users@clusterlabs.org</font><br><font size="2" color="#5F5F5F">Date:        </font><font size="2">01/17/2017 11:41 AM</font><br><font size="2" color="#5F5F5F">Subject:        </font><font size="2">Re: [ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources</font><br><hr width="100%" size="2" align="left" noshade style="color:#8091A5; "><br><br><br><tt>On 01/17/2017 10:19 AM, Scott Greenlese wrote:<br>> Hi..<br>> <br>> I've been testing live guest migration (LGM) with VirtualDomain<br>> resources, which are guests running on Linux KVM / System Z<br>> managed by pacemaker.<br>> <br>> I'm looking for documentation that explains how to configure my<br>> VirtualDomain resources such that they will not timeout<br>> prematurely when there is a heavy I/O workload running on the guest.<br>> <br>> If I perform the LGM with an unmanaged guest (resource disabled), it<br>> takes anywhere from 2 - 5 minutes to complete the LGM.<br>> Example:<br>> <br>> # Migrate guest, specify a timeout value of 600s<br>> <br>> [root@zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live<br>> --persistent --undefinesource*--timeout 600* --verbose zs95kjg110061<br>> qemu+ssh://zs90kppcs1/system<br>> Mon Jan 16 16:35:32 EST 2017<br>> <br>> Migration: [100 %]<br>> <br>> [root@zs95kj VD]# date<br>> Mon Jan 16 16:40:01 EST 2017<br>> [root@zs95kj VD]#<br>> <br>> Start: 16:35:32<br>> End: 16:40:01<br>> Total: *4 min 29 sec*<br>> <br>> <br>> In comparison, when the guest is managed by pacemaker, and enabled for<br>> LGM ... I get this:<br>> <br>> [root@zs95kj VD]# date;pcs resource show zs95kjg110061_res<br>> Mon Jan 16 15:13:33 EST 2017<br>> Resource: zs95kjg110061_res (class=ocf provider=heartbeat<br>> type=VirtualDomain)<br>> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml<br>> hypervisor=qemu:///system migration_transport=ssh<br>> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061<br>> remote-addr=10.20.110.61<br>> Operations: start interval=0s timeout=480<br>> (zs95kjg110061_res-start-interval-0s)<br>> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)<br>> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)<br>> migrate-from interval=0s timeout=1200<br>> (zs95kjg110061_res-migrate-from-interval-0s)<br>> *migrate-to* interval=0s *timeout=1200*<br>> (zs95kjg110061_res-migrate-to-interval-0s)<br>> <br>> NOTE: I didn't specify any migrate-to value for timeout, so it defaulted<br>> to 1200. Is this seconds? If so, that's 20 minutes,<br>> ample time to complete a 5 minute migration.<br><br>Not sure where the default of 1200 comes from, but I believe the default<br>is milliseconds if no unit is specified. Normally you'd specify<br>something like "timeout=1200s".<br><br>> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>> Mon Jan 16 14:27:01 EST 2017<br>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1<br>> [root@zs95kj VD]#<br>> <br>> <br>> [root@zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1*<br>> Mon Jan 16 14:45:39 EST 2017<br>> You have new mail in /var/spool/mail/root<br>> <br>> <br>> Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO:<br>> zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh<br>> --connect=qemu:///system --quiet migrate --live zs95kjg110061<br>> qemu+ssh://zs95kjpcs1/system ).*<br>> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:<br>> zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out<br>> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:<br>> zs95kjg110061_res_migrate_to_0:21050 - timed out after 20000ms<br>> Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation<br>> zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978,<br>> timeout=20000ms)<br>> Jan 16 14:45:58 zs90kp journal: operation failed: migration job:<br>> unexpectedly failed<br>> [root@zs90KP VD]#<br>> <br>> So, the migration timed out after 20000ms. Assuming ms is milliseconds,<br>> that's only 20 seconds. So, it seems that LGM timeout has<br>> nothing to do with *migrate-to* on the resource definition.<br><br>Yes, ms is milliseconds. Pacemaker internally represents all times in<br>milliseconds, even though in most actual usage, it has 1-second granularity.<br><br>If your specified timeout is 1200ms, I'm not sure why it's using<br>20000ms. There may be a minimum enforced somewhere.<br><br>> Also, what is the expected behavior when the migration times out? I<br>> watched the VirtualDomain resource state during the migration process...<br>> <br>> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>> Mon Jan 16 14:45:57 EST 2017<br>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1<br>> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>> Mon Jan 16 14:46:02 EST 2017<br>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1<br>> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>> Mon Jan 16 14:46:06 EST 2017<br>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1<br>> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>> Mon Jan 16 14:46:08 EST 2017<br>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1<br>> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>> Mon Jan 16 14:46:10 EST 2017<br>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1<br>> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>> Mon Jan 16 14:46:12 EST 2017<br>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Stopped<br>> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>> Mon Jan 16 14:46:14 EST 2017<br>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1<br>> [root@zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res<br>> Mon Jan 16 14:46:17 EST 2017<br>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1<br>> [root@zs95kj VD]#<br>> <br>> <br>> So, it seems as if the guest migration actually did succeed, at least<br>> the guest is running<br>> on the target node (KVM host). However... I checked the<br><br>Failure handling is configurable, but by default, if a live migration<br>fails, the cluster will do a full restart (= full stop then start). So<br>basically, it turns from a live migration to a cold migration.<br><br>> "blast" IO workload (writes to external, virtual storage accessible to<br>> both all cluster<br>> hosts)<br>> <br>> I can experiment with different *migrate-to* timeout value settings, but<br>> would really<br>> prefer to have a good understanding of timeout configuration and<br>> recovery behavior first.<br>> <br>> Thanks!<br>> <br>> <br>> Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y.<br>> INTERNET: swgreenl@us.ibm.com<br><br>_______________________________________________<br>Users mailing list: Users@clusterlabs.org<br></tt><tt><a href="http://lists.clusterlabs.org/mailman/listinfo/users">http://lists.clusterlabs.org/mailman/listinfo/users</a></tt><tt><br><br>Project Home: </tt><tt><a href="http://www.clusterlabs.org">http://www.clusterlabs.org</a></tt><tt><br>Getting started: </tt><tt><a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a></tt><tt><br>Bugs: </tt><tt><a href="http://bugs.clusterlabs.org">http://bugs.clusterlabs.org</a></tt><tt><br><br></tt><br><br><BR>

</body></html>