[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Mon Jun 6 13:25:41 EDT 2016

06.06.2016 19:39, Ken Gaillot wrote:
> On 06/05/2016 07:27 PM, Andrew Beekhof wrote:
>> On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>>> On 06/02/2016 08:01 PM, Andrew Beekhof wrote:
>>>> On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>>>>> A recent thread discussed a proposed new feature, a new environment
>>>>> variable that would be passed to resource agents, indicating whether a
>>>>> stop action was part of a recovery.
>>>>>
>>>>> Since that thread was long and covered a lot of topics, I'm starting a
>>>>> new one to focus on the core issue remaining:
>>>>>
>>>>> The original idea was to pass the number of restarts remaining before
>>>>> the resource will no longer tried to be started on the same node. This
>>>>> involves calculating (fail-count - migration-threshold), and that
>>>>> implies certain limitations: (1) it will only be set when the cluster
>>>>> checks migration-threshold; (2) it will only be set for the failed
>>>>> resource itself, not for other resources that may be recovered due to
>>>>> dependencies on it.
>>>>>
>>>>> Ulrich Windl proposed an alternative: setting a boolean value instead. I
>>>>> forgot to cc the list on my reply, so I'll summarize now: We would set a
>>>>> new variable like OCF_RESKEY_CRM_recovery=true
>>>>
>>>> This concept worries me, especially when what we've implemented is
>>>> called OCF_RESKEY_CRM_restarting.
>>>
>>> Agreed; I plan to rename it yet again, to OCF_RESKEY_CRM_start_expected.
>>>
>>>> The name alone encourages people to "optimise" the agent to not
>>>> actually stop the service "because its just going to start again
>>>> shortly".  I know thats not what Adam would do, but not everyone
>>>> understands how clusters work.
>>>>
>>>> There are any number of reasons why a cluster that intends to restart
>>>> a service may not do so.  In such a scenario, a badly written agent
>>>> would cause the cluster to mistakenly believe that the service is
>>>> stopped - allowing it to start elsewhere.
>>>>
>>>> Its true there are any number of ways to write bad agents, but I would
>>>> argue that we shouldn't be nudging people in that direction :)
>>>
>>> I do have mixed feelings about that. I think if we name it
>>> start_expected, and document it carefully, we can avoid any casual mistakes.
>>>
>>> My main question is how useful would it actually be in the proposed use
>>> cases. Considering the possibility that the expected start might never
>>> happen (or fail), can an RA really do anything different if
>>> start_expected=true?
>>
>> I would have thought not.  Correctness should trump optimal.
>> But I'm prepared to be mistaken.
>>
>>> If the use case is there, I have no problem with
>>> adding it, but I want to make sure it's worthwhile.
>
> Anyone have comments on this?
>
> A simple example: pacemaker calls an RA stop with start_expected=true,
> then before the start happens, someone disables the resource, so the
> start is never called. Or the node is fenced before the start happens, etc.
>
> Is there anything significant an RA can do differently based on
> start_expected=true/false without causing problems if an expected start
> never happens?

Yep.

It may request stop of other resources
* on that node by removing some node attributes which participate in 
location constraints
* or cluster-wide by revoking/putting to standby cluster ticket other 
resources depend on

Latter case is that's why I asked about the possibility of passing the 
node name resource is intended to be started on instead of a boolean 
value (in comments to PR #1026) - I would use it to request stop of 
lustre MDTs and OSTs by revoking ticket they depend on if MGS (primary 
lustre component which does all "request routing") fails to start 
anywhere in cluster. That way, if RA does not receive any node name, 
then it can be "almost sure" pacemaker does not intend to restart 
resource (yet) and can request it to stop everything else (because 
filesystem is not usable anyways). Later, if another start attempt 
(caused by failure-timeout expiration) succeeds, RA may grant the ticket 
back, and all other resources start again.

Best,
Vladislav

>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>