[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Tue Jun 7 05:12:04 UTC 2016

07.06.2016 02:20, Ken Gaillot wrote:
> On 06/06/2016 03:30 PM, Vladislav Bogdanov wrote:
>> 06.06.2016 22:43, Ken Gaillot wrote:
>>> On 06/06/2016 12:25 PM, Vladislav Bogdanov wrote:
>>>> 06.06.2016 19:39, Ken Gaillot wrote:
>>>>> On 06/05/2016 07:27 PM, Andrew Beekhof wrote:
>>>>>> On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot <kgaillot at redhat.com>
>>>>>> wrote:
>>>>>>> On 06/02/2016 08:01 PM, Andrew Beekhof wrote:
>>>>>>>> On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot <kgaillot at redhat.com>
>>>>>>>> wrote:
>>>>>>>>> A recent thread discussed a proposed new feature, a new environment
>>>>>>>>> variable that would be passed to resource agents, indicating
>>>>>>>>> whether a
>>>>>>>>> stop action was part of a recovery.
>>>>>>>>>
>>>>>>>>> Since that thread was long and covered a lot of topics, I'm
>>>>>>>>> starting a
>>>>>>>>> new one to focus on the core issue remaining:
>>>>>>>>>
>>>>>>>>> The original idea was to pass the number of restarts remaining
>>>>>>>>> before
>>>>>>>>> the resource will no longer tried to be started on the same node.
>>>>>>>>> This
>>>>>>>>> involves calculating (fail-count - migration-threshold), and that
>>>>>>>>> implies certain limitations: (1) it will only be set when the
>>>>>>>>> cluster
>>>>>>>>> checks migration-threshold; (2) it will only be set for the failed
>>>>>>>>> resource itself, not for other resources that may be recovered
>>>>>>>>> due to
>>>>>>>>> dependencies on it.
>>>>>>>>>
>>>>>>>>> Ulrich Windl proposed an alternative: setting a boolean value
>>>>>>>>> instead. I
>>>>>>>>> forgot to cc the list on my reply, so I'll summarize now: We would
>>>>>>>>> set a
>>>>>>>>> new variable like OCF_RESKEY_CRM_recovery=true
>>>>>>>>
>>>>>>>> This concept worries me, especially when what we've implemented is
>>>>>>>> called OCF_RESKEY_CRM_restarting.
>>>>>>>
>>>>>>> Agreed; I plan to rename it yet again, to
>>>>>>> OCF_RESKEY_CRM_start_expected.
>>>>>>>
>>>>>>>> The name alone encourages people to "optimise" the agent to not
>>>>>>>> actually stop the service "because its just going to start again
>>>>>>>> shortly".  I know thats not what Adam would do, but not everyone
>>>>>>>> understands how clusters work.
>>>>>>>>
>>>>>>>> There are any number of reasons why a cluster that intends to
>>>>>>>> restart
>>>>>>>> a service may not do so.  In such a scenario, a badly written agent
>>>>>>>> would cause the cluster to mistakenly believe that the service is
>>>>>>>> stopped - allowing it to start elsewhere.
>>>>>>>>
>>>>>>>> Its true there are any number of ways to write bad agents, but I
>>>>>>>> would
>>>>>>>> argue that we shouldn't be nudging people in that direction :)
>>>>>>>
>>>>>>> I do have mixed feelings about that. I think if we name it
>>>>>>> start_expected, and document it carefully, we can avoid any casual
>>>>>>> mistakes.
>>>>>>>
>>>>>>> My main question is how useful would it actually be in the
>>>>>>> proposed use
>>>>>>> cases. Considering the possibility that the expected start might
>>>>>>> never
>>>>>>> happen (or fail), can an RA really do anything different if
>>>>>>> start_expected=true?
>>>>>>
>>>>>> I would have thought not.  Correctness should trump optimal.
>>>>>> But I'm prepared to be mistaken.
>>>>>>
>>>>>>> If the use case is there, I have no problem with
>>>>>>> adding it, but I want to make sure it's worthwhile.
>>>>>
>>>>> Anyone have comments on this?
>>>>>
>>>>> A simple example: pacemaker calls an RA stop with start_expected=true,
>>>>> then before the start happens, someone disables the resource, so the
>>>>> start is never called. Or the node is fenced before the start happens,
>>>>> etc.
>>>>>
>>>>> Is there anything significant an RA can do differently based on
>>>>> start_expected=true/false without causing problems if an expected start
>>>>> never happens?
>>>>
>>>> Yep.
>>>>
>>>> It may request stop of other resources
>>>> * on that node by removing some node attributes which participate in
>>>> location constraints
>>>> * or cluster-wide by revoking/putting to standby cluster ticket other
>>>> resources depend on
>>>>
>>>> Latter case is that's why I asked about the possibility of passing the
>>>> node name resource is intended to be started on instead of a boolean
>>>> value (in comments to PR #1026) - I would use it to request stop of
>>>> lustre MDTs and OSTs by revoking ticket they depend on if MGS (primary
>>>> lustre component which does all "request routing") fails to start
>>>> anywhere in cluster. That way, if RA does not receive any node name,
>>>
>>> Why would ordering constraints be insufficient?
>>
>> They are in place, but advisory ones to allow MGS fail/switch-over.
>>>
>>> What happens if the MDTs/OSTs continue running because a start of MGS
>>> was expected, but something prevents the start from actually happening?
>>
>> Nothing critical, lustre clients won't be able to contact them without
>> MGS running and will hang.
>> But it is safer to shutdown them if it is known that MGS cannot be
>> started right now. Especially if geo-cluster failover is expected in
>> that case (as MGS can be local to a site, countrary to all other lustre
>> parts which need to be replicated). Actually that is the only part of a
>> puzzle remaining to "solve" that big project, and IMHO it is enough to
>> have a node name of a intended start or nothing in that attribute
>> (nothing means stop everything and initiate geo-failover if needed). If
>> f.e. fencing happens for a node intended to start resource, then stop
>> will be called again after the next start failure after failure-timeout
>> lapses. That would be much better than no information at all. Total stop
>> or geo-failover will happen just with some (configurable) delay instead
>> of rendering the whole filesystem to an unusable state requiring manual
>> intervention.
>
> My gut feeling is that this is getting RAs a little too involved in the
> cluster's inner workings. If I understand your idea correctly, it would

;)

> be sufficient for your needs to know whether a start is expected on any
> node in the same transition. So maybe start_expected=no/local/peer would
> cover this use case and the original one.

Yes, that is perfectly good for me.

>
>>>
>>>> then it can be "almost sure" pacemaker does not intend to restart
>>>> resource (yet) and can request it to stop everything else (because
>>>> filesystem is not usable anyways). Later, if another start attempt
>>>> (caused by failure-timeout expiration) succeeds, RA may grant the ticket
>>>> back, and all other resources start again.
>>>>
>>>> Best,
>>>> Vladislav
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>