[ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

Tue Jan 17 17:45:35 EST 2017

Ken and Co,

Thanks for the useful information.

I bumped the migrate-to timeout value from 1200ms to 360s , which should be
more than enough time
to successfully migrate the resource (i.e. the KVM guest).  The migration
was again interrupted with a timeout
at the 20000ms (20 second) mark, thus stopping / failing the resource,
which shuts down the guest, then
reassigning the resource to the target node, then cold starting the
resource and guest on the target.

Does anybody know where this 20000ms timeout value gets set?  20sec is
clearly inadequate for a virtual machine
that is running with any kind or I/O or memory intensive workload.  I need
to have a way to change that timeout
value, if not by the migrate-to timeout value, then what?

 Is this internally coded within the class=ocf provider=heartbeat
type=VirtualDomain  resource agent?

 For completeness, here's what I did..

## Show current timeout value.

[root at zs95kj ~]# date;pcs resource show zs95kjg110061_res
Tue Jan 17 13:36:05 EST 2017
 Resource: zs95kjg110061_res (class=ocf provider=heartbeat
type=VirtualDomain)
  Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
hypervisor=qemu:///system migration_transport=ssh
  Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
remote-addr=10.20.110.61 target-role=Stopped
  Operations: start interval=0s timeout=480
(zs95kjg110061_res-start-interval-0s)
              stop interval=0s timeout=120
(zs95kjg110061_res-stop-interval-0s)
              monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
              migrate-from interval=0s timeout=1200
(zs95kjg110061_res-migrate-from-interval-0s)
              migrate-to interval=0s timeout=1200
(zs95kjg110061_res-migrate-to-interval-0s)

## Change timeout value from 1200(ms) to 360s ...

[root at zs95kj ~]# pcs resource update zs95kjg110061_res op migrate-to
timeout="360s"

[root at zs95kj ~]# date;pcs resource show zs95kjg110061_res
Tue Jan 17 13:38:10 EST 2017
 Resource: zs95kjg110061_res (class=ocf provider=heartbeat
type=VirtualDomain)
  Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
hypervisor=qemu:///system migration_transport=ssh
  Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
remote-addr=10.20.110.61 target-role=Stopped
  Operations: start interval=0s timeout=480
(zs95kjg110061_res-start-interval-0s)
              stop interval=0s timeout=120
(zs95kjg110061_res-stop-interval-0s)
              monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
              migrate-from interval=0s timeout=1200
(zs95kjg110061_res-migrate-from-interval-0s)
              migrate-to interval=0s timeout=360s
(zs95kjg110061_res-migrate-to-interval-0s)

[root at zs95kj ~]# date;pcs resource enable zs95kjg110061_res
Tue Jan 17 13:40:55 EST 2017
[root at zs95kj ~]#

[root at zs95kj ~]# date;pcs resource show |grep zs95kjg110061_res
Tue Jan 17 13:41:16 EST 2017
 zs95kjg110061_res      (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1

## Started the I/O intensive 'blast' workload on the guest.

## Initiate the LGM via resource move CLI ..

[root at zs95kj ~]# date;pcs resource move zs95kjg110061_res zs90kppcs1
Tue Jan 17 13:54:50 EST 2017

## System log shows the following error thrown:

Jan 17 13:54:53 zs95kj crmd[27555]:  notice: Operation
zs95kjg110061_stop_0: ok (node=zs95kjpcs1, call=44, rc=0, cib-update=450,
confirmed=true)
Jan 17 13:54:53 zs95kj attrd[27553]:  notice: Removing all zs95kjg110061
attributes for zs95kjpcs1
Jan 17 13:54:54 zs95kj VirtualDomain(zs95kjg110061_res)[135045]: INFO:
zs95kjg110061: Starting live migration to zs90kppcs1 (using: virsh
--connect=qemu:///system --quiet migrate --live  zs95kjg110061 qemu
+ssh://zs90kppcs1/system ).
Jan 17 13:55:14 zs95kj lrmd[27552]: warning: zs95kjg110061_res_migrate_to_0
process (PID 135045) timed out
Jan 17 13:55:14 zs95kj lrmd[27552]: warning:
zs95kjg110061_res_migrate_to_0:135045 - timed out after 20000ms
Jan 17 13:55:14 zs95kj crmd[27555]:   error: Operation
zs95kjg110061_res_migrate_to_0: Timed Out (node=zs95kjpcs1, call=941,
timeout=20000ms)
Jan 17 13:55:15 zs95kj VirtualDomain(zs95kjg110061_res)[136996]: INFO:
Issuing graceful shutdown request for domain zs95kjg110061.
Jan 17 13:55:26 zs95kj systemd-machined: Machine qemu-58-zs95kjg110061
terminated.
Jan 17 13:55:26 zs95kj crmd[27555]:  notice: Operation
zs95kjg110061_res_stop_0: ok (node=zs95kjpcs1, call=943, rc=0,
cib-update=459, confirmed=true)

>>  This is consistent with my original symptom... the "internal" timeout
value of 20000ms seems to override the migrate-to timeout value in the
resource,
if in fact ... the migrate-to operation attribute is what would control the
migration timeout in this case.

I happened to have full pacemaker debug enabled... so, here's generally
what I see in the /var/log/cluster/pacemaker.log associated with the above
migration and
resources associated with zs95kjg110061.  There doesn't seem to be much
logged previous to the migration time out, which occurred at 13:55:14 :

Jan 17 13:54:54 [27555] zs95kj       crmd: (       lrm.c:2005  )   debug:
do_lrm_rsc_op:        Stopped 0 recurring operations in preparation for
zs95kjg110061_res_migrate_to_0
Jan 17 13:54:54 [27555] zs95kj       crmd: (       lrm.c:2009  )    info:
do_lrm_rsc_op:        Performing
key=337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af677dd
op=zs95kjg110061_res_migrate_to_0
Jan 17 13:54:54 [27555] zs95kj       crmd: (lrmd_client.:817   )   trace:
lrmd_send_command:    sending lrmd_rsc_exec op to lrmd
Jan 17 13:54:54 [27555] zs95kj       crmd: (lrmd_client.:488   )   trace:
lrmd_create_op:       Sending call options: 00000004, 4
Jan 17 13:54:54 [27552] zs95kj       lrmd: (      lrmd.c:1644  )   trace:
process_lrmd_message: Processing lrmd_rsc_exec operation from
a2a90d95-a766-42ba-84a0-54574c10adee
Jan 17 13:54:54 [27552] zs95kj       lrmd: (      lrmd.c:347   )   trace:
schedule_lrmd_cmd:    Scheduling migrate_to on zs95kjg110061_res
Jan 17 13:54:54 [27552] zs95kj       lrmd: (      lrmd.c:1684  )   debug:
process_lrmd_message: Processed lrmd_rsc_exec operation from
a2a90d95-a766-42ba-84a0-54574c10adee: rc=941, reply=1, notify=0,
exit=-2147471352
Jan 17 13:54:54 [27552] zs95kj       lrmd: (      lrmd.c:131   )    info:
log_execute:  executing - rsc:zs95kjg110061_res action:migrate_to
call_id:941
Jan 17 13:54:54 [27552] zs95kj       lrmd: (      lrmd.c:1184  )   trace:
lrmd_rsc_execute_service_lib: Creating action, resource:zs95kjg110061_res
action:migrate_to class:ocf provider:heartbeat agent:VirtualDomain
Jan 17 13:54:54 [27555] zs95kj       crmd: (lrmd_client.:846   )   trace:
lrmd_send_command:    lrmd_rsc_exec op reply received
Jan 17 13:54:54 [27555] zs95kj       crmd: (lrmd_client.:852   )   trace:
lrmd_send_command:    Reply   <lrmd_reply lrmd_origin="send_reply"
lrmd_rc="941" lrmd_callid="941"/>
Jan 17 13:54:54 [27555] zs95kj       crmd: (lrmd_client.:300   )   trace:
lrmd_dispatch_internal:       op lrmd_rsc_exec notify event received
Jan 17 13:54:54 [27555] zs95kj       crmd: (       lrm.c:2385  )    info:
process_lrm_event:    Operation zs95kjg110061_res_monitor_30000: Cancelled
(node=zs95kjpcs1, call=936, confirmed=true)
Jan 17 13:54:54 [27555] zs95kj       crmd: (       lrm.c:196   )   debug:
update_history_cache: Updating history for 'zs95kjg110061_res' with monitor
op
VirtualDomain(zs95kjg110061_res)[135045]:       2017/01/17_13:54:54 DEBUG:
Virtual domain zs95kjg110061 is currently running.
VirtualDomain(zs95kjg110061_res)[135045]:       2017/01/17_13:54:54 INFO:
zs95kjg110061: Starting live migration to zs90kppcs1 (using: virsh
--connect=qemu:///system --quiet migrate --live  zs95kjg110061 qemu
+ssh://zs90kppcs1/system ).

Jan 17 13:54:54 [27550] zs95kj        cib: (       xml.c:1728  )    info:
cib_perform_op:       +  /cib/status/node_state[@id='2']/lrm
[@id='2']/lrm_resources/lr
m_resource[@id='zs95kjg110061_res']/lrm_rsc_op
[@id='zs95kjg110061_res_last_0']:
@operation_key=zs95kjg110061_res_migrate_to_0, @operation=migrate_to,
@crm-debu
g-origin=cib_action_update,
@transition-key=337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af677dd,
@transition-magic=-1:193;337:574:0:43c68253-0e7d-4bb5-a78b-9b3c9af67
7dd, @call-id=-1, @rc-code=193, @op-s

### 20 seconds after the live migration was initiated... we get the time
out:

Jan 17 13:55:14 [27552] zs95kj       lrmd: (  mainloop.c:949   ) warning:
child_timeout_callback:       zs95kjg110061_res_migrate_to_0 process (PID
135045) timed out
Jan 17 13:55:14 [27552] zs95kj       lrmd: (services_lin:314   ) warning:
operation_finished:   zs95kjg110061_res_migrate_to_0:135045 - timed out
after 20000ms
Jan 17 13:55:14 [27552] zs95kj       lrmd: (services_lin:332   )   debug:
operation_finished:   zs95kjg110061_res_migrate_to_0:135045:stderr [ --
empty -- ]
Jan 17 13:55:14 [27552] zs95kj       lrmd: (services_lin:336   )   debug:
operation_finished:   zs95kjg110061_res_migrate_to_0:135045:stdout [ --
empty -- ]
Jan 17 13:55:14 [27552] zs95kj       lrmd: (      lrmd.c:592   )   trace:
cmd_finalize: Resource operation rsc:zs95kjg110061_res action:migrate_to
completed (0xa5f1edd0 0xa5f1edd0)
Jan 17 13:55:14 [27552] zs95kj       lrmd: (      lrmd.c:113   )    info:
log_finished: finished - rsc:zs95kjg110061_res action:migrate_to
call_id:941 pid:135045 exit-code:1 exec-time:20003ms queue-time:0ms
Jan 17 13:55:14 [27552] zs95kj       lrmd: (      lrmd.c:1292  )   trace:
lrmd_rsc_execute:     Nothing further to do for zs95kjg110061_res
Jan 17 13:55:14 [27555] zs95kj       crmd: (     utils.c:1942  )   debug:
create_operation_update:      do_update_resource: Updating resource
zs95kjg110061_res after migrate_to op Timed Out (interval=0)
Jan 17 13:55:14 [27555] zs95kj       crmd: (       lrm.c:2397  )   error:
process_lrm_event:    Operation zs95kjg110061_res_migrate_to_0: Timed Out
(node=zs95kjpcs1, call=941, timeout=20000ms)
Jan 17 13:55:14 [27555] zs95kj       crmd: (       lrm.c:196   )   debug:
update_history_cache: Updating history for 'zs95kjg110061_res' with
migrate_to op

Any ideas?

Scott Greenlese ... IBM KVM on System z - Solution Test,  Poughkeepsie,
N.Y.
  INTERNET:  swgreenl at us.ibm.com

From:	Ken Gaillot <kgaillot at redhat.com>
To:	users at clusterlabs.org
Date:	01/17/2017 11:41 AM
Subject:	Re: [ClusterLabs] Live Guest Migration timeouts for
            VirtualDomain resources

On 01/17/2017 10:19 AM, Scott Greenlese wrote:
> Hi..
>
> I've been testing live guest migration (LGM) with VirtualDomain
> resources, which are guests running on Linux KVM / System Z
> managed by pacemaker.
>
> I'm looking for documentation that explains how to configure my
> VirtualDomain resources such that they will not timeout
> prematurely when there is a heavy I/O workload running on the guest.
>
> If I perform the LGM with an unmanaged guest (resource disabled), it
> takes anywhere from 2 - 5 minutes to complete the LGM.
> Example:
>
> # Migrate guest, specify a timeout value of 600s
>
> [root at zs95kj VD]# date;virsh --keepalive-interval 10 migrate --live
> --persistent --undefinesource*--timeout 600* --verbose zs95kjg110061
> qemu+ssh://zs90kppcs1/system
> Mon Jan 16 16:35:32 EST 2017
>
> Migration: [100 %]
>
> [root at zs95kj VD]# date
> Mon Jan 16 16:40:01 EST 2017
> [root at zs95kj VD]#
>
> Start: 16:35:32
> End: 16:40:01
> Total: *4 min 29 sec*
>
>
> In comparison, when the guest is managed by pacemaker, and enabled for
> LGM ... I get this:
>
> [root at zs95kj VD]# date;pcs resource show zs95kjg110061_res
> Mon Jan 16 15:13:33 EST 2017
> Resource: zs95kjg110061_res (class=ocf provider=heartbeat
> type=VirtualDomain)
> Attributes: config=/guestxml/nfs1/zs95kjg110061.xml
> hypervisor=qemu:///system migration_transport=ssh
> Meta Attrs: allow-migrate=true remote-node=zs95kjg110061
> remote-addr=10.20.110.61
> Operations: start interval=0s timeout=480
> (zs95kjg110061_res-start-interval-0s)
> stop interval=0s timeout=120 (zs95kjg110061_res-stop-interval-0s)
> monitor interval=30s (zs95kjg110061_res-monitor-interval-30s)
> migrate-from interval=0s timeout=1200
> (zs95kjg110061_res-migrate-from-interval-0s)
> *migrate-to* interval=0s *timeout=1200*
> (zs95kjg110061_res-migrate-to-interval-0s)
>
> NOTE: I didn't specify any migrate-to value for timeout, so it defaulted
> to 1200. Is this seconds? If so, that's 20 minutes,
> ample time to complete a 5 minute migration.

Not sure where the default of 1200 comes from, but I believe the default
is milliseconds if no unit is specified. Normally you'd specify
something like "timeout=1200s".

> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:27:01 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
> [root at zs95kj VD]#
>
>
> [root at zs95kj VD]# date;*pcs resource move zs95kjg110061_res zs95kjpcs1*
> Mon Jan 16 14:45:39 EST 2017
> You have new mail in /var/spool/mail/root
>
>
> Jan 16 14:45:37 zs90kp VirtualDomain(zs95kjg110061_res)[21050]: INFO:
> zs95kjg110061: *Starting live migration to zs95kjpcs1 (using: virsh
> --connect=qemu:///system --quiet migrate --live zs95kjg110061
> qemu+ssh://zs95kjpcs1/system ).*
> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
> zs95kjg110061_res_migrate_to_0 process (PID 21050) timed out
> Jan 16 14:45:57 zs90kp lrmd[12798]: warning:
> zs95kjg110061_res_migrate_to_0:21050 - timed out after 20000ms
> Jan 16 14:45:57 zs90kp crmd[12801]: error: Operation
> zs95kjg110061_res_migrate_to_0: Timed Out (node=zs90kppcs1, call=1978,
> timeout=20000ms)
> Jan 16 14:45:58 zs90kp journal: operation failed: migration job:
> unexpectedly failed
> [root at zs90KP VD]#
>
> So, the migration timed out after 20000ms. Assuming ms is milliseconds,
> that's only 20 seconds. So, it seems that LGM timeout has
> nothing to do with *migrate-to* on the resource definition.

Yes, ms is milliseconds. Pacemaker internally represents all times in
milliseconds, even though in most actual usage, it has 1-second
granularity.

If your specified timeout is 1200ms, I'm not sure why it's using
20000ms. There may be a minimum enforced somewhere.

> Also, what is the expected behavior when the migration times out? I
> watched the VirtualDomain resource state during the migration process...
>
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:45:57 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs90kppcs1
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:02 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:06 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:08 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:10 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): FAILED zs90kppcs1
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:12 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Stopped
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:14 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
> [root at zs95kj VD]# date;pcs resource show |grep zs95kjg110061_res
> Mon Jan 16 14:46:17 EST 2017
> zs95kjg110061_res (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
> [root at zs95kj VD]#
>
>
> So, it seems as if the guest migration actually did succeed, at least
> the guest is running
> on the target node (KVM host). However... I checked the

Failure handling is configurable, but by default, if a live migration
fails, the cluster will do a full restart (= full stop then start). So
basically, it turns from a live migration to a cold migration.

> "blast" IO workload (writes to external, virtual storage accessible to
> both all cluster
> hosts)
>
> I can experiment with different *migrate-to* timeout value settings, but
> would really
> prefer to have a good understanding of timeout configuration and
> recovery behavior first.
>
> Thanks!
>
>
> Scott Greenlese ... IBM KVM on System z - Solution Test, Poughkeepsie,
N.Y.
> INTERNET: swgreenl at us.ibm.com

_______________________________________________
Users mailing list: Users at clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170117/084fb205/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170117/084fb205/attachment-0003.gif>