[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Tue Jun 7 16:11:19 CEST 2016

Ken Gaillot <kgaillot at redhat.com> wrote:
> On 06/06/2016 05:45 PM, Adam Spiers wrote:
> > Adam Spiers <aspiers at suse.com> wrote:
> >> Andrew Beekhof <abeekhof at redhat.com> wrote:
> >>> On Tue, Jun 7, 2016 at 8:29 AM, Adam Spiers <aspiers at suse.com> wrote:
> >>>> Ken Gaillot <kgaillot at redhat.com> wrote:
> >>>>> My main question is how useful would it actually be in the proposed use
> >>>>> cases. Considering the possibility that the expected start might never
> >>>>> happen (or fail), can an RA really do anything different if
> >>>>> start_expected=true?
> >>>>
> >>>> That's the wrong question :-)
> >>>>
> >>>>> If the use case is there, I have no problem with
> >>>>> adding it, but I want to make sure it's worthwhile.
> >>>>
> >>>> The use case which started this whole thread is for
> >>>> start_expected=false, not start_expected=true.
> >>>
> >>> Isn't this just two sides of the same coin?
> >>> If you're not doing the same thing for both cases, then you're just
> >>> reversing the order of the clauses.
> >>
> >> No, because the stated concern about unreliable expectations
> >> ("Considering the possibility that the expected start might never
> >> happen (or fail)") was regarding start_expected=true, and that's the
> >> side of the coin we don't care about, so it doesn't matter if it's
> >> unreliable.
> > 
> > BTW, if the expected start happens but fails, then Pacemaker will just
> > keep repeating until migration-threshold is hit, at which point it
> > will call the RA 'stop' action finally with start_expected=false.
> > So that's of no concern.
> 
> To clarify, that's configurable, via start-failure-is-fatal and on-fail

Sure.

> > Maybe your point was that if the expected start never happens (so
> > never even gets a chance to fail), we still want to do a nova
> > service-disable?
> 
> That is a good question, which might mean it should be done on every
> stop -- or could that cause problems (besides delays)?

No, the whole point of adding this feature is to avoid a
service-disable on every stop, and instead only do it on the final
stop.  If there are corner cases where we never reach the final stop,
that's not a disaster because nova will eventually figure it out and
do the right thing when the server-agent connection times out.

> Another aspect of this is that the proposed feature could only look at a
> single transition. What if stop is called with start_expected=false, but
> then Pacemaker is able to start the service on the same node in the next
> transition immediately afterward? Would having called service-disable
> cause problems for that start?

We would also need to ensure that service-enable is called on start
when necessary.  Perhaps we could track the enable/disable state in a
local temporary file, and if the file indicates that we've previously
done service-disable, we know to run service-enable on start.  This
would avoid calling service-enable on every single start.

> > Yes that would be nice, but this proposal was never intended to
> > address that.  I guess we'd need an entirely different mechanism in
> > Pacemaker for that.  But let's not allow perfection to become the
> > enemy of the good ;-)
> 
> The ultimate concern is that this will encourage people to write RAs
> that leave services in a dangerous state after stop is called.

I don't see why it would.  The new feature will be obscure enough that
noone would be able to use it without reading the corresponding
documentation first anyway.

> I think with naming and documenting it properly, I'm fine to provide the
> option, but I'm on the fence. Beekhof needs a little more convincing :-)

Can you provide an example of a potential real-world situation where
an RA author would end up accidentally abusing the feature?

Thanks a lot for your continued attention on this!

Adam