[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Mon Jun 6 23:20:39 UTC 2016

On 06/06/2016 03:30 PM, Vladislav Bogdanov wrote:
> 06.06.2016 22:43, Ken Gaillot wrote:
>> On 06/06/2016 12:25 PM, Vladislav Bogdanov wrote:
>>> 06.06.2016 19:39, Ken Gaillot wrote:
>>>> On 06/05/2016 07:27 PM, Andrew Beekhof wrote:
>>>>> On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot <kgaillot at redhat.com>
>>>>> wrote:
>>>>>> On 06/02/2016 08:01 PM, Andrew Beekhof wrote:
>>>>>>> On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot <kgaillot at redhat.com>
>>>>>>> wrote:
>>>>>>>> A recent thread discussed a proposed new feature, a new environment
>>>>>>>> variable that would be passed to resource agents, indicating
>>>>>>>> whether a
>>>>>>>> stop action was part of a recovery.
>>>>>>>>
>>>>>>>> Since that thread was long and covered a lot of topics, I'm
>>>>>>>> starting a
>>>>>>>> new one to focus on the core issue remaining:
>>>>>>>>
>>>>>>>> The original idea was to pass the number of restarts remaining
>>>>>>>> before
>>>>>>>> the resource will no longer tried to be started on the same node.
>>>>>>>> This
>>>>>>>> involves calculating (fail-count - migration-threshold), and that
>>>>>>>> implies certain limitations: (1) it will only be set when the
>>>>>>>> cluster
>>>>>>>> checks migration-threshold; (2) it will only be set for the failed
>>>>>>>> resource itself, not for other resources that may be recovered
>>>>>>>> due to
>>>>>>>> dependencies on it.
>>>>>>>>
>>>>>>>> Ulrich Windl proposed an alternative: setting a boolean value
>>>>>>>> instead. I
>>>>>>>> forgot to cc the list on my reply, so I'll summarize now: We would
>>>>>>>> set a
>>>>>>>> new variable like OCF_RESKEY_CRM_recovery=true
>>>>>>>
>>>>>>> This concept worries me, especially when what we've implemented is
>>>>>>> called OCF_RESKEY_CRM_restarting.
>>>>>>
>>>>>> Agreed; I plan to rename it yet again, to
>>>>>> OCF_RESKEY_CRM_start_expected.
>>>>>>
>>>>>>> The name alone encourages people to "optimise" the agent to not
>>>>>>> actually stop the service "because its just going to start again
>>>>>>> shortly".  I know thats not what Adam would do, but not everyone
>>>>>>> understands how clusters work.
>>>>>>>
>>>>>>> There are any number of reasons why a cluster that intends to
>>>>>>> restart
>>>>>>> a service may not do so.  In such a scenario, a badly written agent
>>>>>>> would cause the cluster to mistakenly believe that the service is
>>>>>>> stopped - allowing it to start elsewhere.
>>>>>>>
>>>>>>> Its true there are any number of ways to write bad agents, but I
>>>>>>> would
>>>>>>> argue that we shouldn't be nudging people in that direction :)
>>>>>>
>>>>>> I do have mixed feelings about that. I think if we name it
>>>>>> start_expected, and document it carefully, we can avoid any casual
>>>>>> mistakes.
>>>>>>
>>>>>> My main question is how useful would it actually be in the
>>>>>> proposed use
>>>>>> cases. Considering the possibility that the expected start might
>>>>>> never
>>>>>> happen (or fail), can an RA really do anything different if
>>>>>> start_expected=true?
>>>>>
>>>>> I would have thought not.  Correctness should trump optimal.
>>>>> But I'm prepared to be mistaken.
>>>>>
>>>>>> If the use case is there, I have no problem with
>>>>>> adding it, but I want to make sure it's worthwhile.
>>>>
>>>> Anyone have comments on this?
>>>>
>>>> A simple example: pacemaker calls an RA stop with start_expected=true,
>>>> then before the start happens, someone disables the resource, so the
>>>> start is never called. Or the node is fenced before the start happens,
>>>> etc.
>>>>
>>>> Is there anything significant an RA can do differently based on
>>>> start_expected=true/false without causing problems if an expected start
>>>> never happens?
>>>
>>> Yep.
>>>
>>> It may request stop of other resources
>>> * on that node by removing some node attributes which participate in
>>> location constraints
>>> * or cluster-wide by revoking/putting to standby cluster ticket other
>>> resources depend on
>>>
>>> Latter case is that's why I asked about the possibility of passing the
>>> node name resource is intended to be started on instead of a boolean
>>> value (in comments to PR #1026) - I would use it to request stop of
>>> lustre MDTs and OSTs by revoking ticket they depend on if MGS (primary
>>> lustre component which does all "request routing") fails to start
>>> anywhere in cluster. That way, if RA does not receive any node name,
>>
>> Why would ordering constraints be insufficient?
> 
> They are in place, but advisory ones to allow MGS fail/switch-over.
>>
>> What happens if the MDTs/OSTs continue running because a start of MGS
>> was expected, but something prevents the start from actually happening?
> 
> Nothing critical, lustre clients won't be able to contact them without
> MGS running and will hang.
> But it is safer to shutdown them if it is known that MGS cannot be
> started right now. Especially if geo-cluster failover is expected in
> that case (as MGS can be local to a site, countrary to all other lustre
> parts which need to be replicated). Actually that is the only part of a
> puzzle remaining to "solve" that big project, and IMHO it is enough to
> have a node name of a intended start or nothing in that attribute
> (nothing means stop everything and initiate geo-failover if needed). If
> f.e. fencing happens for a node intended to start resource, then stop
> will be called again after the next start failure after failure-timeout
> lapses. That would be much better than no information at all. Total stop
> or geo-failover will happen just with some (configurable) delay instead
> of rendering the whole filesystem to an unusable state requiring manual
> intervention.

My gut feeling is that this is getting RAs a little too involved in the
cluster's inner workings. If I understand your idea correctly, it would
be sufficient for your needs to know whether a start is expected on any
node in the same transition. So maybe start_expected=no/local/peer would
cover this use case and the original one.

>>
>>> then it can be "almost sure" pacemaker does not intend to restart
>>> resource (yet) and can request it to stop everything else (because
>>> filesystem is not usable anyways). Later, if another start attempt
>>> (caused by failure-timeout expiration) succeeds, RA may grant the ticket
>>> back, and all other resources start again.
>>>
>>> Best,
>>> Vladislav