[ClusterLabs] Live Guest Migration timeouts for VirtualDomain resources

Wed Feb 1 10:31:56 EST 2017

On 02/01/2017 09:15 AM, Scott Greenlese wrote:
> Hi all...
> 
> Just a quick follow-up.
> 
> Thought I should come clean and share with you that the incorrect
> "migrate-to" operation name defined in my VirtualDomain
> resource was my mistake. It was mis-coded in the virtual guest
> provisioning script. I have since changed it to "migrate_to"
> and of course, the specified live migration timeout value is working
> effectively now. (For some reason, I assumed we were letting that
> operation meta value default).
> 
> I was wondering if someone could refer me to the definitive online link
> for pacemaker resource man pages? I don't see any resource man pages
> installed
> on my system anywhere. I found this one online:
> https://www.mankier.com/7/ocf_heartbeat_VirtualDomain but is there a
> more 'official' page I should refer our
> Linux KVM on System z customers to?

All distributions that I know of include the man pages with the packages
they distribute. Are you building from source? They are named like "man
ocf_heartbeat_IPaddr2".

FYI after following this thread, the pcs developers are making a change
so that pcs refuses to add an unrecognized operation unless the user
uses --force. Thanks for being involved in the community; this is how we
learn to improve!

> Thanks again for your assistance.
> 
> Scott Greenlese ...IBM KVM on System Z Solution Test Poughkeepsie, N.Y.
> INTERNET: swgreenl at us.ibm.com
> 
> 
> Inactive hide details for "Ulrich Windl" ---01/27/2017 02:32:43 AM--->>>
> "Scott Greenlese" <swgreenl at us.ibm.com> schrieb am 27."Ulrich Windl"
> ---01/27/2017 02:32:43 AM--->>> "Scott Greenlese" <swgreenl at us.ibm.com>
> schrieb am 27.01.2017 um 02:47 in Nachricht
> 
> From: "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de>
> To: <users at clusterlabs.org>, Scott Greenlese/Poughkeepsie/IBM at IBMUS
> Cc: "Si Bo Niu" <niusibo at cn.ibm.com>, Michael Tebolt/Poughkeepsie/IBM at IBMUS
> Date: 01/27/2017 02:32 AM
> Subject: Antw: Re: [ClusterLabs] Antw: Re: Live Guest Migration timeouts
> for VirtualDomain resources
> 
> ------------------------------------------------------------------------
> 
> 
> 
>>>> "Scott Greenlese" <swgreenl at us.ibm.com> schrieb am 27.01.2017 um
> 02:47 in
> Nachricht
> <OF63CD0E10.D58C4C3D-ON002580B5.0005C410-852580B5.0009DBDE at notes.na.collabserv.c
> 
> m>:
> 
>> Hi guys..
>>
>> Well, today I confirmed that what Ulrich said is correct.  If I update the
>> VirtualDomain resource with the operation name  "migrate_to" instead of
>> "migrate-to",  it effectively overrides and enforces the 1200ms default
>> value to the new value.
>>
>> I am wondering how I would have known that I was using the wrong operation
>> name, when the initial operation name is already incorrect
>> when the resource is created?
> 
> For SLES 11, I made a quick (portable non-portable unstable) try (print
> the operations known to an RA):
> # crm ra info VirtualDomain |sed -n -e "/Operations' defaults/,\$p"
> Operations' defaults (advisory minimum):
> 
>    start         timeout=90
>    stop          timeout=90
>    status        timeout=30 interval=10
>    monitor       timeout=30 interval=10
>    migrate_from  timeout=60
>    migrate_to    timeout=120
> 
> Regards,
> Ulrich
> 
>>
>> This is what the meta data for my resource looked like after making the
>> update:
>>
>> [root at zs95kj VD]# date;pcs resource update zs95kjg110065_res op migrate_to
>> timeout="360s"
>> Thu Jan 26 16:43:11 EST 2017
>> You have new mail in /var/spool/mail/root
>>
>> [root at zs95kj VD]# date;pcs resource show zs95kjg110065_res
>> Thu Jan 26 16:43:46 EST 2017
>>  Resource: zs95kjg110065_res (class=ocf provider=heartbeat
>> type=VirtualDomain)
>>   Attributes: config=/guestxml/nfs1/zs95kjg110065.xml
>> hypervisor=qemu:///system migration_transport=ssh
>>   Meta Attrs: allow-migrate=true
>>   Operations: start interval=0s timeout=120
>> (zs95kjg110065_res-start-interval-0s)
>>               stop interval=0s timeout=120
>> (zs95kjg110065_res-stop-interval-0s)
>>               monitor interval=30s
> (zs95kjg110065_res-monitor-interval-30s)
>>               migrate-from interval=0s timeout=1200
>> (zs95kjg110065_res-migrate-from-interval-0s)
>>               migrate-to interval=0s timeout=1200
>> (zs95kjg110065_res-migrate-to-interval-0s)   <<< Original op name / value
>>               migrate_to interval=0s timeout=360s
>> (zs95kjg110065_res-migrate_to-interval-0s)  <<< New op name / value
>>
>>
>> Where does that original op name come from in the VirtualDomain resource
>> definition?  How can we get the initial meta value changed and shipped
> with
>> a valid operation name (i.e. migrate_to), and
>> maybe a more reasonable migrate_to timeout value... something
> significantly
>> higher than 1200ms , i.e. 1.2 seconds ?  Can I report this request as a
>> bugzilla on the RHEL side, or should this go to my internal IBM bugzilla
>> for KVM on System Z development?
>>
>> Anyway, thanks so much for identifying my issue.  I can reconfigure my
>> resources to make them tolerate longer migration execution times.
>>
>>
>> Scott Greenlese ... IBM KVM on System Z Solution Test
>>   INTERNET:  swgreenl at us.ibm.com
>>
>>
>>
>>
>> From: Ken Gaillot <kgaillot at redhat.com>
>> To: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>,
>>             users at clusterlabs.org
>> Date: 01/19/2017 10:26 AM
>> Subject: Re: [ClusterLabs] Antw: Re: Live Guest Migration timeouts for
>>             VirtualDomain resources
>>
>>
>>
>> On 01/19/2017 01:36 AM, Ulrich Windl wrote:
>>>>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 18.01.2017 um 16:32 in
>> Nachricht
>>> <4b02d3fa-4693-473b-8bed-dc98f9e3f3f3 at redhat.com>:
>>>> On 01/17/2017 04:45 PM, Scott Greenlese wrote:
>>>>> Ken and Co,
>>>>>
>>>>> Thanks for the useful information.
>>>>>
>>>
>>> [...]
>>>>>
>>>>> Is this internally coded within the class=ocf provider=heartbeat
>>>>> type=VirtualDomain resource agent?
>>>>
>>>> Aha, I just realized what the issue is: the operation name is
>>>> migrate_to, not migrate-to.
>>>>
>>>> For technical reasons, pacemaker can't validate operation names (at the
>>>> time that the configuration is edited, it does not necessarily have
>>>> access to the agent metadata).
>>>
>>> BUT the set of operations is finite, right? So if those were in some XML
>> schema, the names could be verified at least (not meaning that the
>> operation is actually supported).
>>> BTW: Would a "crm configure verify" detect this kijnd of problem?
>>>
>>> [...]
>>>
>>> Ulrich
>>
>> Yes, it's in the resource agent meta-data. While pacemaker itself uses a
>> small set of well-defined actions, the agent may define any arbitrarily
>> named actions it desires, and the user could configure one of these as a
>> recurring action in pacemaker.
>>
>> Pacemaker itself has to be liberal about where its configuration comes
>> from -- the configuration can be edited on a separate machine, which
>> doesn't have resource agents, and then uploaded to the cluster. So
>> Pacemaker can't do that validation at configuration time. (It could
>> theoretically do some checking after the fact when the configuration is
>> loaded, but this could be a lot of overhead, and there are
>> implementation issues at the moment.)
>>
>> Higher-level tools like crmsh and pcs, on the other hand, can make
>> simplifying assumptions. They can require access to the resource agents
>> so that they can do extra validation.