[ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

Wed Jan 18 17:35:22 CET 2017

Dne 18.1.2017 v 16:32 Ken Gaillot napsal(a):
> On 01/17/2017 04:45 PM, Scott Greenlese wrote:
>> Ken and Co,
>>
>> Thanks for the useful information.
>>
>> I bumped the migrate-to timeout value from 1200ms to 360s , which should
>> be more than enough time
>> to successfully migrate the resource (i.e. the KVM guest). The migration
>> was again interrupted with a timeout
>> at the 20000ms (20 second) mark, thus stopping / failing the resource,
>> which shuts down the guest, then
>> reassigning the resource to the target node, then cold starting the
>> resource and guest on the target.
>>
>> Does anybody know where this 20000ms timeout value gets set? 20sec is
>> clearly inadequate for a virtual machine
>> that is running with any kind or I/O or memory intensive workload. I
>> need to have a way to change that timeout
>> value, if not by the migrate-to timeout value, then what?
>>
>> Is this internally coded within the class=ocf provider=heartbeat
>> type=VirtualDomain resource agent?
>
> Aha, I just realized what the issue is: the operation name is
> migrate_to, not migrate-to.
>
> For technical reasons, pacemaker can't validate operation names (at the
> time that the configuration is edited, it does not necessarily have
> access to the agent metadata).

We'll add operation name validation to pcs to cover this issue. For "pcs 
resource create" command it will be done in a near future. For the rest 
of the commands I cannot really promise any date however. This is not on 
our priority list right now.

Regards,
Tomas

>
>> For completeness, here's what I did..
>>
>> ## Show current timeout value.
>>
>> [root at zs95kj ~]# date;pcs resource show zs95kjg110061_res
>> Tue Jan 17 13:36:05 EST 2017
>> Resource: zs95kjg110061_res (class=ocf provider=heartbeat
>> type=VirtualDomain)
>> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
>> hypervisor=qemu:///system migration_transport=ssh
>> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
>> remote-addr=10.20.110.61 target-role=Stopped
>> Operations: start interval=0s timeout=480
>> (zs95kjg110061_res-start-interval-0s)
>> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
>> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
>> migrate-from interval=0s timeout=1200
>> (zs95kjg110061_res-migrate-from-interval-0s)
>> migrate-to interval=0s *timeout=1200*
>> (zs95kjg110061_res-migrate-to-interval-0s)
>>
>>
>> ## Change timeout value from 1200(ms) to 360s ...
>>
>>
>> [root at zs95kj ~]# pcs resource update zs95kjg110061_res op migrate-to
>> timeout="360s"
>>
>> [root at zs95kj ~]# date;pcs resource show zs95kjg110061_res
>> Tue Jan 17 13:38:10 EST 2017
>> Resource: zs95kjg110061_res (class=ocf provider=heartbeat
>> type=VirtualDomain)
>> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
>> hypervisor=qemu:///system migration_transport=ssh
>> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
>> remote-addr=10.20.110.61 target-role=Stopped
>> Operations: start interval=0s timeout=480
>> (zs95kjg110061_res-start-interval-0s)
>> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
>> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
>> migrate-from interval=0s timeout=1200
>> (zs95kjg110061_res-migrate-from-interval-0s)
>> migrate-to interval=0s*timeout=360s*
>> (zs95kjg110061_res-migrate-to-interval-0s)
>>
>>
>> [root at zs95kj ~]# date;pcs resource enable zs95kjg110061_res
>> Tue Jan 17 13:40:55 EST 2017
>> [root at zs95kj ~]#
>>
>>
>> [root at zs95kj ~]# date;pcs resource show |grep zs95kjg110061_res
>> Tue Jan 17 13:41:16 EST 2017
>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
>>
>>
>> ## Started the I/O intensive 'blast' workload on the guest.
>>
>>
>> ## Initiate the LGM via resource move CLI ..
>>
>>
>> [root at zs95kj ~]# date;pcs resource move zs95kjg110061_res zs90kppcs1
>> Tue Jan 17 13:54:50 EST 2017
>>
>>
>> ## System log shows the following error thrown:
>>
>> Jan 17 13:54:53 zs95kj crmd[27555]: notice: Operation
>> zs95kjg110061_stop_0: ok (node=zs95kjpcs1, call=44, rc=0,
>> cib-update=450, confirmed=true)
>> Jan 17 13:54:53 zs95kj attrd[27553]: notice: Removing all zs95kjg110061
>> attributes for zs95kjpcs1
>> Jan 17 13:54:54 zs95kj VirtualDomain(zs95kjg110061_res)[135045]: INFO:
>> zs95kjg110061: *Starting live migration to zs90kppcs1 (using: virsh
>> --connect=qemu:///system --quiet migrate --live zs95kjg110061
>> qemu+ssh://zs90kppcs1/system ).*
>> Jan 17 13:55:14 zs95kj lrmd[27552]: warning:
>> zs95kjg110061_res_migrate_to_0 process (PID 135045) timed out
>> Jan 17 13:55:14 zs95kj lrmd[27552]: warning:
>> zs95kjg110061_res_migrate_to_0:135045 - timed out after 20000ms
>> Jan 17 13:55:14 zs95kj crmd[27555]: error: Operation
>> zs95kjg110061_res_migrate_to_0: Timed Out (node=zs95kjpcs1, call=941,
>> timeout=20000ms)
>> Jan 17 13:55:15 zs95kj VirtualDomain(zs95kjg110061_res)[136996]: INFO:
>> Issuing graceful shutdown request for domain zs95kjg110061.
>> Jan 17 13:55:26 zs95kj systemd-machined: Machine qemu-58-zs95kjg110061
>> terminated.
>> Jan 17 13:55:26 zs95kj crmd[27555]: notice: Operation
>> zs95kjg110061_res_stop_0: ok (node=zs95kjpcs1, call=943, rc=0,
>> cib-update=459, confirmed=true)
>>
>>
>>>> This is consistent with my original symptom... the "internal" timeout
>> value of 20000ms seems to override the migrate-to timeout value in the
>> resource,
>> if in fact ... the migrate-to operation attribute is what would control
>> the migration timeout in this case.
>>
>>
>> I happened to have full pacemaker debug enabled... so, here's generally
>> what I see in the /var/log/cluster/pacemaker.log associated with the
>> above migration and
>> resources associated with zs95kjg110061. There doesn't seem to be much
>> logged previous to the migration time out, which occurred at 13:55:14 :
>>
>> Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:2005 ) debug:
>> do_lrm_rsc_op: Stopped 0 recurring operations in preparation for
>> zs95kjg110061_res_migrate_to_0
>> Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:2009 ) info: do_lrm_rsc_op:
>> Performing key=337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af677dd
>> op=zs95kjg110061_res_migrate_to_0
>> Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:817 ) trace:
>> lrmd_send_command: sending lrmd_rsc_exec op to lrmd
>> Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:488 ) trace:
>> lrmd_create_op: Sending call options: 00000004, 4
>> Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:1644 ) trace:
>> process_lrmd_message: Processing lrmd_rsc_exec operation from
>> a2a90d95-a766-42ba-84a0-54574c10adee
>> Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:347 ) trace:
>> schedule_lrmd_cmd: Scheduling migrate_to on zs95kjg110061_res
>> Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:1684 ) debug:
>> process_lrmd_message: Processed lrmd_rsc_exec operation from
>> a2a90d95-a766-42ba-84a0-54574c10adee: rc=941, reply=1, notify=0,
>> exit=-2147471352
>> Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:131 ) info: log_execute:
>> executing - rsc:zs95kjg110061_res action:migrate_to call_id:941
>> Jan 17 13:54:54 [27552] zs95kj lrmd: ( lrmd.c:1184 ) trace:
>> lrmd_rsc_execute_service_lib: Creating action,
>> resource:zs95kjg110061_res action:migrate_to class:ocf
>> provider:heartbeat agent:VirtualDomain
>> Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:846 ) trace:
>> lrmd_send_command: lrmd_rsc_exec op reply received
>> Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:852 ) trace:
>> lrmd_send_command: Reply <lrmd_reply lrmd_origin="send_reply"
>> lrmd_rc="941" lrmd_callid="941"/>
>> Jan 17 13:54:54 [27555] zs95kj crmd: (lrmd_client.:300 ) trace:
>> lrmd_dispatch_internal: op lrmd_rsc_exec notify event received
>> Jan 17 13:54:54 [27555] zs95kj crmd: ( lrm.c:2385 ) info:
>> process_lrm_event: Operation zs95kjg110061_res_monitor_30000: Cancelled
>> (node=zs95kjpcs1, call=936, confirmed=true)
>> Jan 17 *13:54:54* [27555] zs95kj crmd: ( lrm.c:196 ) debug:
>> update_history_cache: Updating history for 'zs95kjg110061_res' with
>> monitor op
>> VirtualDomain(zs95kjg110061_res)[135045]: 2017/01/17_13:54:54 DEBUG:
>> Virtual domain zs95kjg110061 is currently running.
>> VirtualDomain(zs95kjg110061_res)[135045]: 2017/01/17_13:54:54
>> INFO:*zs95kjg110061: Starting live migration to zs90kppcs1 (using: virsh
>> --connect=qemu:///system --quiet migrate --live zs95kjg110061
>> qemu+ssh://zs90kppcs1/system ).*
>>
>> Jan 17 13:54:54 [27550] zs95kj cib: ( xml.c:1728 ) info: cib_perform_op:
>> + /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lr
>> m_resource[@id='zs95kjg110061_res']/lrm_rsc_op[@id='zs95kjg110061_res_last_0']:
>> @operation_key=zs95kjg110061_res_migrate_to_0, @operation=migrate_to,
>> @crm-debu
>> g-origin=cib_action_update,
>> @transition-key=337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af677dd,
>> @transition-magic=-1:193;337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af67
>> 7dd, @call-id=-1, @rc-code=193, @op-s
>>
>>
>> ### 20 seconds after the live migration was initiated... we get the time
>> out:
>>
>> Jan 17 13:55:14 [27552] zs95kj lrmd: ( mainloop.c:949 ) warning:
>> child_timeout_callback: zs95kjg110061_res_migrate_to_0 process (PID
>> 135045) timed out
>> Jan 17 *13:55:14* [27552] zs95kj lrmd: (services_lin:314 ) warning:
>> operation_finished: zs95kjg110061_res_migrate_to_0:135045 - timed out
>> after 20000ms
>> Jan 17 13:55:14 [27552] zs95kj lrmd: (services_lin:332 ) debug:
>> operation_finished: zs95kjg110061_res_migrate_to_0:135045:stderr [ --
>> empty -- ]
>> Jan 17 13:55:14 [27552] zs95kj lrmd: (services_lin:336 ) debug:
>> operation_finished: zs95kjg110061_res_migrate_to_0:135045:stdout [ --
>> empty -- ]
>> Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:592 ) trace: cmd_finalize:
>> Resource operation rsc:zs95kjg110061_res action:migrate_to completed
>> (0xa5f1edd0 0xa5f1edd0)
>> Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:113 ) info: log_finished:
>> finished - rsc:zs95kjg110061_res action:migrate_to call_id:941
>> pid:135045 exit-code:1 exec-time:20003ms queue-time:0ms
>> Jan 17 13:55:14 [27552] zs95kj lrmd: ( lrmd.c:1292 ) trace:
>> lrmd_rsc_execute: Nothing further to do for zs95kjg110061_res
>> Jan 17 13:55:14 [27555] zs95kj crmd: ( utils.c:1942 ) debug:
>> create_operation_update: do_update_resource: Updating resource
>> zs95kjg110061_res after migrate_to op Timed Out (interval=0)
>> Jan 17 13:55:14 [27555] zs95kj crmd: ( lrm.c:2397 ) error:
>> process_lrm_event: Operation zs95kjg110061_res_migrate_to_0: Timed Out
>> (node=zs95kjpcs1, call=941, timeout=20000ms)
>> Jan 17 13:55:14 [27555] zs95kj crmd: ( lrm.c:196 ) debug:
>> update_history_cache: Updating history for 'zs95kjg110061_res' with
>> migrate_to op
>>
>>
>> Any ideas?
>>
>>
>>
>> Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie, N.Y.
>> INTERNET: swgreenl at us.ibm.com
>>
>>
>> Inactive hide details for Ken Gaillot ---01/17/2017 11:41:53 AM---On
>> 01/17/2017 10:19 AM, Scott Greenlese wrote: > Hi..Ken Gaillot
>> ---01/17/2017 11:41:53 AM---On 01/17/2017 10:19 AM, Scott Greenlese
>> wrote: > Hi..
>>
>> From: Ken Gaillot <kgaillot at redhat.com>
>> To: users at clusterlabs.org
>> Date: 01/17/2017 11:41 AM
>> Subject: Re: [ClusterLabs] Live Guest Migration timeouts for
>> VirtualDomain resources
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> On 01/17/2017 10:19 AM, Scott Greenlese wrote:
>>> Hi..
>>>
>>> I've been testing live guest migration (LGM) with VirtualDomain
>>> resources, which are guests running on Linux KVM / System Z
>>> managed by pacemaker.
>>>
>>> I'm looking for documentation that explains how to configure my
>>> VirtualDomain resources such that they will not timeout
>>> prematurely when there is a heavy I/O workload running on the guest.
>>>
>>> If I perform the LGM with an unmanaged guest (resource disabled), it
>>> takes anywhere from 2 - 5 minutes to complete the LGM.
>>> Example:
>>>
>>> # Migrate guest, specify a timeout value of 600s
>>>
>>> [root at zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live
>>> --persistent --undefinesource*--timeout 600* --verbose zs95kjg110061
>>> qemu+ssh://zs90kppcs1/system
>>> Mon Jan 16 16:35:32 EST 2017
>>>
>>> Migration: [100 %]
>>>
>>> [root at zs95kj VD]# date
>>> Mon Jan 16 16:40:01 EST 2017
>>> [root at zs95kj VD]#
>>>
>>> Start: 16:35:32
>>> End: 16:40:01
>>> Total: *4 min 29 sec*
>>>
>>>
>>> In comparison, when the guest is managed by pacemaker, and enabled for
>>> LGM ... I get this:
>>>
>>> [root at zs95kj VD]# date;pcs resource show zs95kjg110061_res
>>> Mon Jan 16 15:13:33 EST 2017
>>> Resource: zs95kjg110061_res (class=ocf provider=heartbeat
>>> type=VirtualDomain)
>>> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
>>> hypervisor=qemu:///system migration_transport=ssh
>>> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
>>> remote-addr=10.20.110.61
>>> Operations: start interval=0s timeout=480
>>> (zs95kjg110061_res-start-interval-0s)
>>> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
>>> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
>>> migrate-from interval=0s timeout=1200
>>> (zs95kjg110061_res-migrate-from-interval-0s)
>>> *migrate-to* interval=0s *timeout=1200*
>>> (zs95kjg110061_res-migrate-to-interval-0s)
>>>
>>> NOTE: I didn't specify any migrate-to value for timeout, so it defaulted
>>> to 1200. Is this seconds? If so, that's 20 minutes,
>>> ample time to complete a 5 minute migration.
>>
>> Not sure where the default of 1200 comes from, but I believe the default
>> is milliseconds if no unit is specified. Normally you'd specify
>> something like "timeout=1200s".
>>
>>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>>> Mon Jan 16 14:27:01 EST 2017
>>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
>>> [root at zs95kj VD]#
>>>
>>>
>>> [root at zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1*
>>> Mon Jan 16 14:45:39 EST 2017
>>> You have new mail in /var/spool/mail/root
>>>
>>>
>>> Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO:
>>> zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh
>>> --connect=qemu:///system --quiet migrate --live zs95kjg110061
>>> qemu+ssh://zs95kjpcs1/system ).*
>>> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
>>> zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out
>>> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
>>> zs95kjg110061_res_migrate_to_0:21050 - timed out after 20000ms
>>> Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation
>>> zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978,
>>> timeout=20000ms)
>>> Jan 16 14:45:58 zs90kp journal: operation failed: migration job:
>>> unexpectedly failed
>>> [root at zs90KP VD]#
>>>
>>> So, the migration timed out after 20000ms. Assuming ms is milliseconds,
>>> that's only 20 seconds. So, it seems that LGM timeout has
>>> nothing to do with *migrate-to* on the resource definition.
>>
>> Yes, ms is milliseconds. Pacemaker internally represents all times in
>> milliseconds, even though in most actual usage, it has 1-second granularity.
>>
>> If your specified timeout is 1200ms, I'm not sure why it's using
>> 20000ms. There may be a minimum enforced somewhere.
>>
>>> Also, what is the expected behavior when the migration times out? I
>>> watched the VirtualDomain resource state during the migration process...
>>>
>>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>>> Mon Jan 16 14:45:57 EST 2017
>>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
>>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>>> Mon Jan 16 14:46:02 EST 2017
>>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
>>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>>> Mon Jan 16 14:46:06 EST 2017
>>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
>>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>>> Mon Jan 16 14:46:08 EST 2017
>>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
>>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>>> Mon Jan 16 14:46:10 EST 2017
>>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
>>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>>> Mon Jan 16 14:46:12 EST 2017
>>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Stopped
>>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>>> Mon Jan 16 14:46:14 EST 2017
>>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
>>> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
>>> Mon Jan 16 14:46:17 EST 2017
>>> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
>>> [root at zs95kj VD]#
>>>
>>>
>>> So, it seems as if the guest migration actually did succeed, at least
>>> the guest is running
>>> on the target node (KVM host). However... I checked the
>>
>> Failure handling is configurable, but by default, if a live migration
>> fails, the cluster will do a full restart (= full stop then start). So
>> basically, it turns from a live migration to a cold migration.
>>
>>> "blast" IO workload (writes to external, virtual storage accessible to
>>> both all cluster
>>> hosts)
>>>
>>> I can experiment with different *migrate-to* timeout value settings, but
>>> would really
>>> prefer to have a good understanding of timeout configuration and
>>> recovery behavior first.
>>>
>>> Thanks!
>>>
>>>
>>> Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie,
>> N.Y.
>>> INTERNET: swgreenl at us.ibm.com
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>