[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Wed Jun 8 10:23:36 CEST 2016

Andrew Beekhof <abeekhof at redhat.com> wrote:
> On Wed, Jun 8, 2016 at 12:11 AM, Adam Spiers <aspiers at suse.com> wrote:
> > Ken Gaillot <kgaillot at redhat.com> wrote:
> >> On 06/06/2016 05:45 PM, Adam Spiers wrote:
> >> > Adam Spiers <aspiers at suse.com> wrote:
> >> >> Andrew Beekhof <abeekhof at redhat.com> wrote:
> >> >>> On Tue, Jun 7, 2016 at 8:29 AM, Adam Spiers <aspiers at suse.com> wrote:
> >> >>>> Ken Gaillot <kgaillot at redhat.com> wrote:
> >> >>>>> My main question is how useful would it actually be in the proposed use
> >> >>>>> cases. Considering the possibility that the expected start might never
> >> >>>>> happen (or fail), can an RA really do anything different if
> >> >>>>> start_expected=true?
> >> >>>>
> >> >>>> That's the wrong question :-)
> >> >>>>
> >> >>>>> If the use case is there, I have no problem with
> >> >>>>> adding it, but I want to make sure it's worthwhile.
> >> >>>>
> >> >>>> The use case which started this whole thread is for
> >> >>>> start_expected=false, not start_expected=true.
> >> >>>
> >> >>> Isn't this just two sides of the same coin?
> >> >>> If you're not doing the same thing for both cases, then you're just
> >> >>> reversing the order of the clauses.
> >> >>
> >> >> No, because the stated concern about unreliable expectations
> >> >> ("Considering the possibility that the expected start might never
> >> >> happen (or fail)") was regarding start_expected=true, and that's the
> >> >> side of the coin we don't care about, so it doesn't matter if it's
> >> >> unreliable.
> >> >
> >> > BTW, if the expected start happens but fails, then Pacemaker will just
> >> > keep repeating until migration-threshold is hit, at which point it
> >> > will call the RA 'stop' action finally with start_expected=false.
> >> > So that's of no concern.
> >>
> >> To clarify, that's configurable, via start-failure-is-fatal and on-fail
> >
> > Sure.
> >
> >> > Maybe your point was that if the expected start never happens (so
> >> > never even gets a chance to fail), we still want to do a nova
> >> > service-disable?
> >>
> >> That is a good question, which might mean it should be done on every
> >> stop -- or could that cause problems (besides delays)?
> >
> > No, the whole point of adding this feature is to avoid a
> > service-disable on every stop, and instead only do it on the final
> > stop.  If there are corner cases where we never reach the final stop,
> > that's not a disaster because nova will eventually figure it out and
> > do the right thing when the server-agent connection times out.
> >
> >> Another aspect of this is that the proposed feature could only look at a
> >> single transition. What if stop is called with start_expected=false, but
> >> then Pacemaker is able to start the service on the same node in the next
> >> transition immediately afterward? Would having called service-disable
> >> cause problems for that start?
> >
> > We would also need to ensure that service-enable is called on start
> > when necessary.  Perhaps we could track the enable/disable state in a
> > local temporary file, and if the file indicates that we've previously
> > done service-disable, we know to run service-enable on start.  This
> > would avoid calling service-enable on every single start.
> 
> feels like an over-optimization
> in fact, the whole thing feels like that if i'm honest.

Huh ... You didn't seem to think that when we discussed automating
service-disable at length in Austin.  What changed?  Can you suggest a
better approach?

> why are we trying to optimise the projected performance impact

It's not really "projected"; we know exactly what the impact is.  And
it's not really a performance impact either.  If nova-compute (or a
dependency) is malfunctioning on a compute node, there will be a
window (bounded by nova.conf's rpc_response_timeout value, IIUC) in
which nova-scheduler could still schedule VMs onto that compute node,
and then of course they'll fail to boot.

The masakari folks have a lot of operational experience in this space,
and they found that this was enough of a problem to justify calling
nova service-disable whenever the failure is detected.  So this
feature has been in masakari for a long time already.  We are simply
aiming for feature parity in a converged upstream implementation.

> when the system is in terrible shape already?

I'm not sure what you mean by that, but it's impossible to do anything
about it unless you at least give some details, if not propose a
solution ;-)  But openstack-dev@ would be a better list for that.

> >> > Yes that would be nice, but this proposal was never intended to
> >> > address that.  I guess we'd need an entirely different mechanism in
> >> > Pacemaker for that.  But let's not allow perfection to become the
> >> > enemy of the good ;-)
> >>
> >> The ultimate concern is that this will encourage people to write RAs
> >> that leave services in a dangerous state after stop is called.
> >
> > I don't see why it would.
> 
> Previous experience suggests it definitely will.
>
> People will do exactly what you're thinking but with something important.
> They'll see it behaves as they expect in best-case testing and never
> think about the corner cases.
> Then they'll start thinking about optimising their start operations,
> write some "optimistic" state recording code and break those too.
> 
> Imagine a bug in your state recording code (maybe you forget to handle
> a missing state file after reboot) that means the 'enable' does't get
> run.  The service is up, but nova will never use it.
> 
> > The new feature will be obscure enough that
> > noone would be able to use it without reading the corresponding
> > documentation first anyway.
> 
> I like your optimism.
> 
> >> I think with naming and documenting it properly, I'm fine to provide the
> >> option, but I'm on the fence. Beekhof needs a little more convincing :-)
> >
> > Can you provide an example of a potential real-world situation where
> > an RA author would end up accidentally abusing the feature?
> 
> You want a real-world example of how someone could accidentally
> mis-using a feature that doesn't exist yet?
> 
> Um... if we knew all the weird and wonderful ways people break our
> code we'd be able to build a better mouse trap.

So what are you suggesting?  That we should deliberately avoid making
any progress, based on nebulous fear of other people making stupid
mistakes in ways that we can't even think of?  I'm totally open to
other ideas, but I'm not hearing any yet.