[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Mon Jun 6 00:27:12 UTC 2016

On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
> On 06/02/2016 08:01 PM, Andrew Beekhof wrote:
>> On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>>> A recent thread discussed a proposed new feature, a new environment
>>> variable that would be passed to resource agents, indicating whether a
>>> stop action was part of a recovery.
>>>
>>> Since that thread was long and covered a lot of topics, I'm starting a
>>> new one to focus on the core issue remaining:
>>>
>>> The original idea was to pass the number of restarts remaining before
>>> the resource will no longer tried to be started on the same node. This
>>> involves calculating (fail-count - migration-threshold), and that
>>> implies certain limitations: (1) it will only be set when the cluster
>>> checks migration-threshold; (2) it will only be set for the failed
>>> resource itself, not for other resources that may be recovered due to
>>> dependencies on it.
>>>
>>> Ulrich Windl proposed an alternative: setting a boolean value instead. I
>>> forgot to cc the list on my reply, so I'll summarize now: We would set a
>>> new variable like OCF_RESKEY_CRM_recovery=true
>>
>> This concept worries me, especially when what we've implemented is
>> called OCF_RESKEY_CRM_restarting.
>
> Agreed; I plan to rename it yet again, to OCF_RESKEY_CRM_start_expected.
>
>> The name alone encourages people to "optimise" the agent to not
>> actually stop the service "because its just going to start again
>> shortly".  I know thats not what Adam would do, but not everyone
>> understands how clusters work.
>>
>> There are any number of reasons why a cluster that intends to restart
>> a service may not do so.  In such a scenario, a badly written agent
>> would cause the cluster to mistakenly believe that the service is
>> stopped - allowing it to start elsewhere.
>>
>> Its true there are any number of ways to write bad agents, but I would
>> argue that we shouldn't be nudging people in that direction :)
>
> I do have mixed feelings about that. I think if we name it
> start_expected, and document it carefully, we can avoid any casual mistakes.
>
> My main question is how useful would it actually be in the proposed use
> cases. Considering the possibility that the expected start might never
> happen (or fail), can an RA really do anything different if
> start_expected=true?

I would have thought not.  Correctness should trump optimal.
But I'm prepared to be mistaken.

> If the use case is there, I have no problem with
> adding it, but I want to make sure it's worthwhile.