[ClusterLabs] Antw: Re: Antw: Re: Live Guest Migration timeouts for VirtualDomain resources

Fri Jan 27 02:32:34 EST 2017

>>> "Scott Greenlese" <swgreenl at us.ibm.com> schrieb am 27.01.2017 um 02:47 in
Nachricht
<OF63CD0E10.D58C4C3D-ON002580B5.0005C410-852580B5.0009DBDE at notes.na.collabserv.c 
m>:

> Hi guys..
> 
> Well, today I confirmed that what Ulrich said is correct.  If I update the
> VirtualDomain resource with the operation name  "migrate_to" instead of
> "migrate-to",  it effectively overrides and enforces the 1200ms default
> value to the new value.
> 
> I am wondering how I would have known that I was using the wrong operation
> name, when the initial operation name is already incorrect
> when the resource is created?

For SLES 11, I made a quick (portable non-portable unstable) try (print the operations known to an RA):
 # crm ra info VirtualDomain |sed -n -e "/Operations' defaults/,\$p"
Operations' defaults (advisory minimum):

    start         timeout=90
    stop          timeout=90
    status        timeout=30 interval=10
    monitor       timeout=30 interval=10
    migrate_from  timeout=60
    migrate_to    timeout=120

Regards,
Ulrich

> 
> This is what the meta data for my resource looked like after making the
> update:
> 
> [root at zs95kj VD]# date;pcs resource update zs95kjg110065_res op migrate_to
> timeout="360s"
> Thu Jan 26 16:43:11 EST 2017
> You have new mail in /var/spool/mail/root
> 
> [root at zs95kj VD]# date;pcs resource show zs95kjg110065_res
> Thu Jan 26 16:43:46 EST 2017
>  Resource: zs95kjg110065_res (class=ocf provider=heartbeat
> type=VirtualDomain)
>   Attributes: config=/guestxml/nfs1/zs95kjg110065.xml
> hypervisor=qemu:///system migration_transport=ssh
>   Meta Attrs: allow-migrate=true
>   Operations: start interval=0s timeout=120
> (zs95kjg110065_res-start-interval-0s)
>               stop interval=0s timeout=120
> (zs95kjg110065_res-stop-interval-0s)
>               monitor interval=30s (zs95kjg110065_res-monitor-interval-30s)
>               migrate-from interval=0s timeout=1200
> (zs95kjg110065_res-migrate-from-interval-0s)
>               migrate-to interval=0s timeout=1200
> (zs95kjg110065_res-migrate-to-interval-0s)   <<< Original op name / value
>               migrate_to interval=0s timeout=360s
> (zs95kjg110065_res-migrate_to-interval-0s)  <<< New op name / value
> 
> 
> Where does that original op name come from in the VirtualDomain resource
> definition?  How can we get the initial meta value changed and shipped with
> a valid operation name (i.e. migrate_to), and
> maybe a more reasonable migrate_to timeout value... something significantly
> higher than 1200ms , i.e. 1.2 seconds ?  Can I report this request as a
> bugzilla on the RHEL side, or should this go to my internal IBM bugzilla
> for KVM on System Z development?
> 
> Anyway, thanks so much for identifying my issue.  I can reconfigure my
> resources to make them tolerate longer migration execution times.
> 
> 
> Scott Greenlese ... IBM KVM on System Z Solution Test
>   INTERNET:  swgreenl at us.ibm.com 
> 
> 
> 
> 
> From:	Ken Gaillot <kgaillot at redhat.com>
> To:	Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>,
>             users at clusterlabs.org 
> Date:	01/19/2017 10:26 AM
> Subject:	Re: [ClusterLabs] Antw: Re: Live Guest Migration timeouts for
>             VirtualDomain resources
> 
> 
> 
> On 01/19/2017 01:36 AM, Ulrich Windl wrote:
>>>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 18.01.2017 um 16:32 in
> Nachricht
>> <4b02d3fa-4693-473b-8bed-dc98f9e3f3f3 at redhat.com>:
>>> On 01/17/2017 04:45 PM, Scott Greenlese wrote:
>>>> Ken and Co,
>>>>
>>>> Thanks for the useful information.
>>>>
>>
>> [...]
>>>>
>>>> Is this internally coded within the class=ocf provider=heartbeat
>>>> type=VirtualDomain resource agent?
>>>
>>> Aha, I just realized what the issue is: the operation name is
>>> migrate_to, not migrate-to.
>>>
>>> For technical reasons, pacemaker can't validate operation names (at the
>>> time that the configuration is edited, it does not necessarily have
>>> access to the agent metadata).
>>
>> BUT the set of operations is finite, right? So if those were in some XML
> schema, the names could be verified at least (not meaning that the
> operation is actually supported).
>> BTW: Would a "crm configure verify" detect this kijnd of problem?
>>
>> [...]
>>
>> Ulrich
> 
> Yes, it's in the resource agent meta-data. While pacemaker itself uses a
> small set of well-defined actions, the agent may define any arbitrarily
> named actions it desires, and the user could configure one of these as a
> recurring action in pacemaker.
> 
> Pacemaker itself has to be liberal about where its configuration comes
> from -- the configuration can be edited on a separate machine, which
> doesn't have resource agents, and then uploaded to the cluster. So
> Pacemaker can't do that validation at configuration time. (It could
> theoretically do some checking after the fact when the configuration is
> loaded, but this could be a lot of overhead, and there are
> implementation issues at the moment.)
> 
> Higher-level tools like crmsh and pcs, on the other hand, can make
> simplifying assumptions. They can require access to the resource agents
> so that they can do extra validation.
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org