[ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

Wed Jan 18 16:32:52 CET 2017

On 01/17/2017 04:45 PM, Scott Greenlese wrote:
> Ken and Co,
> 
> Thanks for the useful information.
> 
> I bumped the migrate-to timeout value from 1200ms to 360s , which should
> be more than enough time
> to successfully migrate the resource (i.e. the KVM guest). The migration
> was again interrupted with a timeout
> at the 20000ms (20 second) mark, thus stopping / failing the resource,
> which shuts down the guest, then
> reassigning the resource to the target node, then cold starting the
> resource and guest on the target.
> 
> Does anybody know where this 20000ms timeout value gets set? 20sec is
> clearly inadequate for a virtual machine
> that is running with any kind or I/O or memory intensive workload. I
> need to have a way to change that timeout
> value, if not by the migrate-to timeout value, then what?
> 
> Is this internally coded within the class=ocf provider=heartbeat
> type=VirtualDomain resource agent?

Aha, I just realized what the issue is: the operation name is
migrate_to, not migrate-to.

For technical reasons, pacemaker can't validate operation names (at the
time that the configuration is edited, it does not necessarily have
access to the agent metadata).

> For completeness, here's what I did..
> 
> ## Show current timeout value.
> 
> [root at zs95kj ~]# date;pcs resource show zs95kjg110061_res
> Tue Jan 17 13:36:05 EST 2017
> Resource: zs95kjg110061_res (class=ocf provider=heartbeat
> type=VirtualDomain)
> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
> hypervisor=qemu:///system migration_transport=ssh
> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
> remote-addr=10.20.110.61 target-role=Stopped
> Operations: start interval=0s timeout=480
> (zs95kjg110061_res-start-interval-0s)
> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
> migrate-from interval=0s timeout=1200
> (zs95kjg110061_res-migrate-from-interval-0s)
> migrate-to interval=0s *timeout=1200*
> (zs95kjg110061_res-migrate-to-interval-0s)
> 
> 
> ## Change timeout value from 1200(ms) to 360s ...
> 
> 
> [root at zs95kj ~]# pcs resource update zs95kjg110061_res op migrate-to
> timeout="360s"
> 
> [root at zs95kj ~]# date;pcs resource show zs95kjg110061_res
> Tue Jan 17 13:38:10 EST 2017
> Resource: zs95kjg110061_res (class=ocf provider=heartbeat
> type=VirtualDomain)
> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
> hypervisor=qemu:///system migration_transport=ssh
> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
> remote-addr=10.20.110.61 target-role=Stopped
> Operations: start interval=0s timeout=480
> (zs95kjg110061_res-start-interval-0s)
> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
> migrate-from interval=0s timeout=1200
> (zs95kjg110061_res-migrate-from-interval-0s)
> migrate-to interval=0s*timeout=360s*
> (zs95kjg110061_res-migrate-to-interval-0s)
> 
> 
> [root at zs95kj ~]# date;pcs resource enable zs95kjg110061_res
> Tue Jan 17 13:40:55 EST 2017
> [root at zs95kj ~]#
> 
> 
> [root at zs95kj ~]# date;pcs resource show |grep zs95kjg110061_res
> Tue Jan 17 13:41:16 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
> 
> 
> ## Started the I/O intensive 'blast' workload on the guest.
> 
> 
> ## Initiate the LGM via resource move CLI ..
> 
> 
> [root at zs95kj ~]# date;pcs resource move zs95kjg110061_res zs90kppcs1
> Tue Jan 17 13:54:50 EST 2017
> 
> 
> ## System log shows the following error thrown:
> 
> Jan 17 13:54:53 zs95kj crmd[27555]: notice: Operation
> zs95kjg110061_stop_0: ok (node=zs95kjpcs1, call=44, rc=0,
> cib-update=450, confirmed=true)
> Jan 17 13:54:53 zs95kj attrd[27553]: notice: Removing all zs95kjg110061
> attributes for zs95kjpcs1
> Jan 17 13:54:54 zs95kj VirtualDomain(zs95kjg110061_res)[135045]: INFO:
> zs95kjg110061: *Starting live migration to zs90kppcs1 (using: virsh
> --connect=qemu:///system --quiet migrate --live zs95kjg110061
> qemu+ssh://zs90kppcs1/system ).*
> Jan 17 13:55:14 zs95kj lrmd[27552]: warning:
> zs95kjg110061_res_migrate_to_0 process (PID 135045) timed out
> Jan 17 13:55:14 zs95kj lrmd[27552]: warning:
> zs95kjg110061_res_migrate_to_0:135045 - timed out after 20000ms
> Jan 17 13:55:14 zs95kj crmd[27555]: error: Operation
> zs95kjg110061_res_migrate_to_0: Timed Out (node=zs95kjpcs1, call=941,
> timeout=20000ms)
> Jan 17 13:55:15 zs95kj VirtualDomain(zs95kjg110061_res)[136996]: INFO:
> Issuing graceful shutdown request for domain zs95kjg110061.
> Jan 17 13:55:26 zs95kj systemd-machined: Machine qemu-58-zs95kjg110061
> terminated.
> Jan 17 13:55:26 zs95kj crmd[27555]: notice: Operation
> zs95kjg110061_res_stop_0: ok (node=zs95kjpcs1, call=943, rc=0,
> cib-update=459, confirmed=true)
> 
> 
>>> This is consistent with my original symptom... the "internal" timeout
> value of 20000ms seems to override the migrate-to timeout value in the
> resource,
> if in fact ... the migrate-to operation attribute is what would control
> the migration timeout in this case.
> 
> 
> I happened to have full pacemaker debug enabled... so, here's generally
> what I see in the /var/log/cluster/pacemaker.log associated with the
> above migration and
> resources associated with zs95kjg110061. There doesn't seem to be much
> logged previous to the migration time out, which occurred at 13:55:14 :
> 
> Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:2005 ) debug:
> do_lrm_rsc_op: Stopped 0 recurring operations in preparation for
> zs95kjg110061_res_migrate_to_0
> Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:2009 ) info: do_lrm_rsc_op:
> Performing key=337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af677dd
> op=zs95kjg110061_res_migrate_to_0
> Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:817 ) trace:
> lrmd_send_command: sending lrmd_rsc_exec op to lrmd
> Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:488 ) trace:
> lrmd_create_op: Sending call options: 00000004, 4
> Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:1644 ) trace:
> process_lrmd_message: Processing lrmd_rsc_exec operation from
> a2a90d95-a766-42ba-84a0-54574c10adee
> Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:347 ) trace:
> schedule_lrmd_cmd: Scheduling migrate_to on zs95kjg110061_res
> Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:1684 ) debug:
> process_lrmd_message: Processed lrmd_rsc_exec operation from
> a2a90d95-a766-42ba-84a0-54574c10adee: rc=941, reply=1, notify=0,
> exit=-2147471352
> Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:131 ) info: log_execute:
> executing - rsc:zs95kjg110061_res action:migrate_to call_id:941
> Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:1184 ) trace:
> lrmd_rsc_execute_service_lib: Creating action,
> resource:zs95kjg110061_res action:migrate_to class:ocf
> provider:heartbeat agent:VirtualDomain
> Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:846 ) trace:
> lrmd_send_command: lrmd_rsc_exec op reply received
> Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:852 ) trace:
> lrmd_send_command: Reply <lrmd_reply lrmd_origin="send_reply"
> lrmd_rc="941" lrmd_callid="941"/>
> Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:300 ) trace:
> lrmd_dispatch_internal: op lrmd_rsc_exec notify event received
> Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:2385 ) info:
> process_lrm_event: Operation zs95kjg110061_res_monitor_30000: Cancelled
> (node=zs95kjpcs1, call=936, confirmed=true)
> Jan 17 *13:54:54* [27555] zs95kj crmd: ( lrm.c:196 ) debug:
> update_history_cache: Updating history for 'zs95kjg110061_res' with
> monitor op
> VirtualDomain(zs95kjg110061_res)[135045]: 2017/01/17_13:54:54 DEBUG:
> Virtual domain zs95kjg110061 is currently running.
> VirtualDomain(zs95kjg110061_res)[135045]: 2017/01/17_13:54:54
> INFO:*zs95kjg110061: Starting live migration to zs90kppcs1 (using: virsh
> --connect=qemu:///system --quiet migrate --live zs95kjg110061
> qemu+ssh://zs90kppcs1/system ).*
> 
> Jan 17 13:54:54 [27550] zs95kj cib: ( xml.c:1728 ) info: cib_perform_op:
> + /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lr
> m_resource[@id='zs95kjg110061_res']/lrm_rsc_op[@id='zs95kjg110061_res_last_0']:
> @operation_key=zs95kjg110061_res_migrate_to_0, @operation=migrate_to,
> @crm-debu
> g-origin=cib_action_update,
> @transition-key=337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af677dd,
> @transition-magic=-1:193;337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af67
> 7dd, @call-id=-1, @rc-code=193, @op-s
> 
> 
> ### 20 seconds after the live migration was initiated... we get the time
> out:
> 
> Jan 17 13:55:14 [27552] zs95kj lrmd: ( mainloop.c:949 ) warning:
> child_timeout_callback: zs95kjg110061_res_migrate_to_0 process (PID
> 135045) timed out
> Jan 17 *13:55:14* [27552] zs95kj lrmd: (services_lin:314 ) warning:
> operation_finished: zs95kjg110061_res_migrate_to_0:135045 - timed out
> after 20000ms
> Jan 17 13:55:14 [27552] zs95kj lrmd: (services_lin:332 ) debug:
> operation_finished: zs95kjg110061_res_migrate_to_0:135045:stderr [ --
> empty -- ]
> Jan 17 13:55:14 [27552] zs95kj lrmd: (services_lin:336 ) debug:
> operation_finished: zs95kjg110061_res_migrate_to_0:135045:stdout [ --
> empty -- ]
> Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:592 ) trace: cmd_finalize:
> Resource operation rsc:zs95kjg110061_res action:migrate_to completed
> (0xa5f1edd0 0xa5f1edd0)
> Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:113 ) info: log_finished:
> finished - rsc:zs95kjg110061_res action:migrate_to call_id:941
> pid:135045 exit-code:1 exec-time:20003ms queue-time:0ms
> Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:1292 ) trace:
> lrmd_rsc_execute: Nothing further to do for zs95kjg110061_res
> Jan 17 13:55:14 [27555] zs95kj crmd: ( utils.c:1942 ) debug:
> create_operation_update: do_update_resource: Updating resource
> zs95kjg110061_res after migrate_to op Timed Out (interval=0)
> Jan 17 13:55:14 [27555] zs95kj crmd: ( lrm.c:2397 ) error:
> process_lrm_event: Operation zs95kjg110061_res_migrate_to_0: Timed Out
> (node=zs95kjpcs1, call=941, timeout=20000ms)
> Jan 17 13:55:14 [27555] zs95kj crmd: ( lrm.c:196 ) debug:
> update_history_cache: Updating history for 'zs95kjg110061_res' with
> migrate_to op
> 
> 
> Any ideas?
> 
> 
> 
> Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y.
> INTERNET: swgreenl at us.ibm.com
> 
> 
> Inactive hide details for Ken Gaillot ---01/17/2017 11:41:53 AM---On
> 01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi..Ken Gaillot
> ---01/17/2017 11:41:53 AM---On 01/17/2017 10:19 AM, Scott Greenlese
> wrote: > Hi..
> 
> From: Ken Gaillot <kgaillot at redhat.com>
> To: users at clusterlabs.org
> Date: 01/17/2017 11:41 AM
> Subject: Re: [ClusterLabs] Live Guest Migration timeouts for
> VirtualDomain resources
> 
> ------------------------------------------------------------------------
> 
> 
> 
> On 01/17/2017 10:19 AM, Scott Greenlese wrote:
>> Hi..
>>
>> I've been testing live guest migration (LGM) with VirtualDomain
>> resources, which are guests running on Linux KVM / System Z
>> managed by pacemaker.
>>
>> I'm looking for documentation that explains how to configure my
>> VirtualDomain resources such that they will not timeout
>> prematurely when there is a heavy I/O workload running on the guest.
>>
>> If I perform the LGM with an unmanaged guest (resource disabled), it
>> takes anywhere from 2 - 5 minutes to complete the LGM.
>> Example:
>>
>> # Migrate guest, specify a timeout value of 600s
>>
>> [root at zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live
>> --persistent --undefinesource*--timeout 600* --verbose zs95kjg110061
>> qemu+ssh://zs90kppcs1/system
>> Mon Jan 16 16:35:32 EST 2017
>>
>> Migration: [100 %]
>>
>> [root at zs95kj VD]# date
>> Mon Jan 16 16:40:01 EST 2017
>> [root at zs95kj VD]#
>>
>> Start: 16:35:32
>> End: 16:40:01
>> Total: *4 min 29 sec*
>>
>>
>> In comparison, when the guest is managed by pacemaker, and enabled for
>> LGM ... I get this:
>>
>> [root at zs95kj VD]# date;pcs resource show zs95kjg110061_res
>> Mon Jan 16 15:13:33 EST 2017
>> Resource: zs95kjg110061_res (class=ocf provider=heartbeat
>> type=VirtualDomain)
>> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
>> hypervisor=qemu:///system migration_transport=ssh
>> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
>> remote-addr=10.20.110.61
>> Operations: start interval=0s timeout=480
>> (zs95kjg110061_res-start-interval-0s)
>> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
>> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
>> migrate-from interval=0s timeout=1200
>> (zs95kjg110061_res-migrate-from-interval-0s)
>> *migrate-to* interval=0s *timeout=1200*
>> (zs95kjg110061_res-migrate-to-interval-0s)
>>
>> NOTE: I didn't specify any migrate-to value for timeout, so it defaulted
>> to 1200. Is this seconds? If so, that's 20 minutes,
>> ample time to complete a 5 minute migration.
> 
> Not sure where the default of 1200 comes from, but I believe the default
> is milliseconds if no unit is specified. Normally you'd specify
> something like "timeout=1200s".
> 
>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>> Mon Jan 16 14:27:01 EST 2017
>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
>> [root at zs95kj VD]#
>>
>>
>> [root at zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1*
>> Mon Jan 16 14:45:39 EST 2017
>> You have new mail in /var/spool/mail/root
>>
>>
>> Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO:
>> zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh
>> --connect=qemu:///system --quiet migrate --live zs95kjg110061
>> qemu+ssh://zs95kjpcs1/system ).*
>> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
>> zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out
>> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
>> zs95kjg110061_res_migrate_to_0:21050 - timed out after 20000ms
>> Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation
>> zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978,
>> timeout=20000ms)
>> Jan 16 14:45:58 zs90kp journal: operation failed: migration job:
>> unexpectedly failed
>> [root at zs90KP VD]#
>>
>> So, the migration timed out after 20000ms. Assuming ms is milliseconds,
>> that's only 20 seconds. So, it seems that LGM timeout has
>> nothing to do with *migrate-to* on the resource definition.
> 
> Yes, ms is milliseconds. Pacemaker internally represents all times in
> milliseconds, even though in most actual usage, it has 1-second granularity.
> 
> If your specified timeout is 1200ms, I'm not sure why it's using
> 20000ms. There may be a minimum enforced somewhere.
> 
>> Also, what is the expected behavior when the migration times out? I
>> watched the VirtualDomain resource state during the migration process...
>>
>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>> Mon Jan 16 14:45:57 EST 2017
>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>> Mon Jan 16 14:46:02 EST 2017
>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>> Mon Jan 16 14:46:06 EST 2017
>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>> Mon Jan 16 14:46:08 EST 2017
>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>> Mon Jan 16 14:46:10 EST 2017
>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>> Mon Jan 16 14:46:12 EST 2017
>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Stopped
>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>> Mon Jan 16 14:46:14 EST 2017
>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>> Mon Jan 16 14:46:17 EST 2017
>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
>> [root at zs95kj VD]#
>>
>>
>> So, it seems as if the guest migration actually did succeed, at least
>> the guest is running
>> on the target node (KVM host). However... I checked the
> 
> Failure handling is configurable, but by default, if a live migration
> fails, the cluster will do a full restart (= full stop then start). So
> basically, it turns from a live migration to a cold migration.
> 
>> "blast" IO workload (writes to external, virtual storage accessible to
>> both all cluster
>> hosts)
>>
>> I can experiment with different *migrate-to* timeout value settings, but
>> would really
>> prefer to have a good understanding of timeout configuration and
>> recovery behavior first.
>>
>> Thanks!
>>
>>
>> Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie,
> N.Y.
>> INTERNET: swgreenl at us.ibm.com
>