[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Mon Jun 6 23:21:01 UTC 2016

On Tue, Jun 7, 2016 at 9:07 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
> On 06/06/2016 05:45 PM, Adam Spiers wrote:
>> Adam Spiers <aspiers at suse.com> wrote:
>>> Andrew Beekhof <abeekhof at redhat.com> wrote:
>>>> On Tue, Jun 7, 2016 at 8:29 AM, Adam Spiers <aspiers at suse.com> wrote:
>>>>> Ken Gaillot <kgaillot at redhat.com> wrote:
>>>>>> My main question is how useful would it actually be in the proposed use
>>>>>> cases. Considering the possibility that the expected start might never
>>>>>> happen (or fail), can an RA really do anything different if
>>>>>> start_expected=true?
>>>>>
>>>>> That's the wrong question :-)
>>>>>
>>>>>> If the use case is there, I have no problem with
>>>>>> adding it, but I want to make sure it's worthwhile.
>>>>>
>>>>> The use case which started this whole thread is for
>>>>> start_expected=false, not start_expected=true.
>>>>
>>>> Isn't this just two sides of the same coin?
>>>> If you're not doing the same thing for both cases, then you're just
>>>> reversing the order of the clauses.
>>>
>>> No, because the stated concern about unreliable expectations
>>> ("Considering the possibility that the expected start might never
>>> happen (or fail)") was regarding start_expected=true, and that's the
>>> side of the coin we don't care about, so it doesn't matter if it's
>>> unreliable.
>>
>> BTW, if the expected start happens but fails, then Pacemaker will just
>> keep repeating until migration-threshold is hit, at which point it
>> will call the RA 'stop' action finally with start_expected=false.
>> So that's of no concern.
>
> To clarify, that's configurable, via start-failure-is-fatal and on-fail
>
>> Maybe your point was that if the expected start never happens (so
>> never even gets a chance to fail), we still want to do a nova
>> service-disable?
>
> That is a good question, which might mean it should be done on every
> stop -- or could that cause problems (besides delays)?
>
> Another aspect of this is that the proposed feature could only look at a
> single transition. What if stop is called with start_expected=false, but
> then Pacemaker is able to start the service on the same node in the next
> transition immediately afterward? Would having called service-disable
> cause problems for that start?
>
>> Yes that would be nice, but this proposal was never intended to
>> address that.  I guess we'd need an entirely different mechanism in
>> Pacemaker for that.  But let's not allow perfection to become the
>> enemy of the good ;-)
>
> The ultimate concern is that this will encourage people to write RAs
> that leave services in a dangerous state after stop is called.
>
> I think with naming and documenting it properly, I'm fine to provide the
> option, but I'm on the fence. Beekhof needs a little more convincing :-)

I think the new name is a big step in the right direction