[ClusterLabs] Antw: Re: FR: send failcount to OCF RA start/stop actions

Tue May 10 10:17:22 EDT 2016

On 05/10/2016 02:29 AM, Ulrich Windl wrote:
>>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 10.05.2016 um 00:40 in Nachricht
> <573111D3.7060102 at redhat.com>:
>> On 05/04/2016 11:47 AM, Adam Spiers wrote:
>>> Ken Gaillot <kgaillot at redhat.com> wrote:
>>>> On 05/04/2016 08:49 AM, Klaus Wenninger wrote:
>>>>> On 05/04/2016 02:09 PM, Adam Spiers wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> As discussed with Ken and Andrew at the OpenStack summit last week, we
>>>>>> would like Pacemaker to be extended to export the current failcount as
>>>>>> an environment variable to OCF RA scripts when they are invoked with
>>>>>> 'start' or 'stop' actions.  This would mean that if you have
>>>>>> start-failure-is-fatal=false and migration-threshold=3 (say), then you
>>>>>> would be able to implement a different behaviour for the third and
>>>>>> final 'stop' of a service executed on a node, which is different to
>>>>>> the previous 'stop' actions executed just prior to attempting a
>>>>>> restart of the service.  (In the non-clone case, this would happen
>>>>>> just before migrating the service to another node.)
>>>>> So what you actually want to know is how much headroom
>>>>> there still is till the resource would be migrated.
>>>>> So wouldn't it then be much more catchy if we don't pass
>>>>> the failcount but rather the headroom?
>>>>
>>>> Yes, that's the plan: pass a new environment variable with
>>>> (migration-threshold - fail-count) when recovering a resource. I haven't
>>>> worked out the exact behavior yet, but that's the idea. I do hope to get
>>>> this in 1.1.15 since it's a small change.
>>>>
>>>> The advantage over using crm_failcount is that it will be limited to the
>>>> current recovery attempt, and it will calculate the headroom as you say,
>>>> rather than the raw failcount.
>>>
>>> Headroom sounds more usable, but if it's not significant extra work,
>>> why not pass both?  It could come in handy, even if only for more
>>> informative logging from the RA.
>>>
>>> Thanks a lot!
>>
>> Here is what I'm testing currently:
>>
>> - When the cluster recovers a resource, the resource agent's stop action
>> will get a new variable, OCF_RESKEY_CRM_meta_recovery_left =
>> migration-threshold - fail-count on the local node.
> 
> With that mechanism RA testingwill be more complicated as it is now, and I cannot see the benefit yet.

Testing will be more complicated for RAs that choose to behave
differently depending on the variable value, but the vast, vast majority
won't, so it will have no effect on most users. No pacemaker behavior
changes.

BTW I should have explicitly mentioned that the variable name is up for
discussion; I had a hard time coming up with something meaningful that
didn't span an entire line of text.

>>
>> - The variable is not added for any action other than stop.
>>
>> - I'm preferring simplicity over flexibility by providing only a single
>> variable. The RA theoretically can already get the migration-threshold
>> from the CIB and fail-count from attrd -- what we're adding is the
>> knowledge that the stop is part of a recovery.
>>
>> - If the stop is final (the cluster does not plan to start the resource
>> anywhere), the variable may be set to 0, or unset. The RA should treat 0
>> and unset as equivalent.
>>
>> - So, the variable will be 1 for the stop before the last time the
>> cluster will try to start the resource on the same node, and 0 or unset
>> for the last stop on this node before trying to start on another node.
> 
> Be aware that the node could be fenced (for reasons ouside of your RA) even before all these attempts are carried out.

Yes, by listing such scenarios and the ones below, I am hoping the
potential users of this feature can think through whether it will be
sufficient for their use cases.

>>
>> - The variable will be set only in situations when the cluster will
>> consider migration-threshold. This makes sense, but some situations may
>> be unintuitive:
>>
>> -- If a resource is being recovered, but the fail-count is being cleared
>> in the same transition, the cluster will ignore migration-threshold (and
>> the variable will not be set). The RA might see recovery_left=5, 4, 3,
>> then someone clears the fail-count, and it won't see recovery_left even
>> though there is a stop and start being attempted.
>>
>> -- Migration-threshold will be considered (and the variable will be set)
>> only if the resource is being recovered due to failure, not if the
>> resource is being restarted or moved for some other reason (constraints,
>> node standby, etc.).
>>
>> -- The previous point is true even if the resource is restarting/moving
>> because it is part of a group with another member being recovered due to
>> failure. Only the failed resource will get the variable. I can see this
>> might be problematic for interested RAs, because the resource may be
>> restarted several times on the local node then forced away, without the
>> variable ever being present -- but the resource will be forced away
>> because it is part of a group that is moving, not because it is being
>> recovered (its own fail-count stays 0).
>>
>> Let me know if you see any problems or have any suggestions.
> 
> Can you summarize in one sentence what problem your proposal will solve?

While it may be useful to others in the future, the one use case it is
intended to address at the moment is:

The resource agent for OpenStack compute nodes can disable nova on the
local node if the cluster will not try to restart the agent there.

More generally, I suppose the point is to better support services that
can do a lesser tear-down for a stop-start cycle than a full stop. The
distinction between the two cases may not be 100% clear (as with your
fencing example), but the idea is that it would be used for
optimization, not some required behavior.

I am not sure the current implementation described above is sufficient,
but it should be a good starting point to work from.