[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Mon Jun 6 21:43:10 CEST 2016

On 06/06/2016 12:25 PM, Vladislav Bogdanov wrote:
> 06.06.2016 19:39, Ken Gaillot wrote:
>> On 06/05/2016 07:27 PM, Andrew Beekhof wrote:
>>> On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot <kgaillot at redhat.com>
>>> wrote:
>>>> On 06/02/2016 08:01 PM, Andrew Beekhof wrote:
>>>>> On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot <kgaillot at redhat.com>
>>>>> wrote:
>>>>>> A recent thread discussed a proposed new feature, a new environment
>>>>>> variable that would be passed to resource agents, indicating
>>>>>> whether a
>>>>>> stop action was part of a recovery.
>>>>>>
>>>>>> Since that thread was long and covered a lot of topics, I'm
>>>>>> starting a
>>>>>> new one to focus on the core issue remaining:
>>>>>>
>>>>>> The original idea was to pass the number of restarts remaining before
>>>>>> the resource will no longer tried to be started on the same node.
>>>>>> This
>>>>>> involves calculating (fail-count - migration-threshold), and that
>>>>>> implies certain limitations: (1) it will only be set when the cluster
>>>>>> checks migration-threshold; (2) it will only be set for the failed
>>>>>> resource itself, not for other resources that may be recovered due to
>>>>>> dependencies on it.
>>>>>>
>>>>>> Ulrich Windl proposed an alternative: setting a boolean value
>>>>>> instead. I
>>>>>> forgot to cc the list on my reply, so I'll summarize now: We would
>>>>>> set a
>>>>>> new variable like OCF_RESKEY_CRM_recovery=true
>>>>>
>>>>> This concept worries me, especially when what we've implemented is
>>>>> called OCF_RESKEY_CRM_restarting.
>>>>
>>>> Agreed; I plan to rename it yet again, to
>>>> OCF_RESKEY_CRM_start_expected.
>>>>
>>>>> The name alone encourages people to "optimise" the agent to not
>>>>> actually stop the service "because its just going to start again
>>>>> shortly".  I know thats not what Adam would do, but not everyone
>>>>> understands how clusters work.
>>>>>
>>>>> There are any number of reasons why a cluster that intends to restart
>>>>> a service may not do so.  In such a scenario, a badly written agent
>>>>> would cause the cluster to mistakenly believe that the service is
>>>>> stopped - allowing it to start elsewhere.
>>>>>
>>>>> Its true there are any number of ways to write bad agents, but I would
>>>>> argue that we shouldn't be nudging people in that direction :)
>>>>
>>>> I do have mixed feelings about that. I think if we name it
>>>> start_expected, and document it carefully, we can avoid any casual
>>>> mistakes.
>>>>
>>>> My main question is how useful would it actually be in the proposed use
>>>> cases. Considering the possibility that the expected start might never
>>>> happen (or fail), can an RA really do anything different if
>>>> start_expected=true?
>>>
>>> I would have thought not.  Correctness should trump optimal.
>>> But I'm prepared to be mistaken.
>>>
>>>> If the use case is there, I have no problem with
>>>> adding it, but I want to make sure it's worthwhile.
>>
>> Anyone have comments on this?
>>
>> A simple example: pacemaker calls an RA stop with start_expected=true,
>> then before the start happens, someone disables the resource, so the
>> start is never called. Or the node is fenced before the start happens,
>> etc.
>>
>> Is there anything significant an RA can do differently based on
>> start_expected=true/false without causing problems if an expected start
>> never happens?
> 
> Yep.
> 
> It may request stop of other resources
> * on that node by removing some node attributes which participate in
> location constraints
> * or cluster-wide by revoking/putting to standby cluster ticket other
> resources depend on
> 
> Latter case is that's why I asked about the possibility of passing the
> node name resource is intended to be started on instead of a boolean
> value (in comments to PR #1026) - I would use it to request stop of
> lustre MDTs and OSTs by revoking ticket they depend on if MGS (primary
> lustre component which does all "request routing") fails to start
> anywhere in cluster. That way, if RA does not receive any node name,

Why would ordering constraints be insufficient?

What happens if the MDTs/OSTs continue running because a start of MGS
was expected, but something prevents the start from actually happening?

> then it can be "almost sure" pacemaker does not intend to restart
> resource (yet) and can request it to stop everything else (because
> filesystem is not usable anyways). Later, if another start attempt
> (caused by failure-timeout expiration) succeeds, RA may grant the ticket
> back, and all other resources start again.
> 
> Best,
> Vladislav