[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Mon Jun 13 11:34:00 UTC 2016

Andrew Beekhof <abeekhof at redhat.com> wrote:
> On Wed, Jun 8, 2016 at 6:23 PM, Adam Spiers <aspiers at suse.com> wrote:
> > Andrew Beekhof <abeekhof at redhat.com> wrote:
> >> On Wed, Jun 8, 2016 at 12:11 AM, Adam Spiers <aspiers at suse.com> wrote:
> >> > Ken Gaillot <kgaillot at redhat.com> wrote:
> >> >> On 06/06/2016 05:45 PM, Adam Spiers wrote:
> >> >> > Maybe your point was that if the expected start never happens (so
> >> >> > never even gets a chance to fail), we still want to do a nova
> >> >> > service-disable?
> >> >>
> >> >> That is a good question, which might mean it should be done on every
> >> >> stop -- or could that cause problems (besides delays)?
> >> >
> >> > No, the whole point of adding this feature is to avoid a
> >> > service-disable on every stop, and instead only do it on the final
> >> > stop.  If there are corner cases where we never reach the final stop,
> >> > that's not a disaster because nova will eventually figure it out and
> >> > do the right thing when the server-agent connection times out.
> >> >
> >> >> Another aspect of this is that the proposed feature could only look at a
> >> >> single transition. What if stop is called with start_expected=false, but
> >> >> then Pacemaker is able to start the service on the same node in the next
> >> >> transition immediately afterward? Would having called service-disable
> >> >> cause problems for that start?
> >> >
> >> > We would also need to ensure that service-enable is called on start
> >> > when necessary.  Perhaps we could track the enable/disable state in a
> >> > local temporary file, and if the file indicates that we've previously
> >> > done service-disable, we know to run service-enable on start.  This
> >> > would avoid calling service-enable on every single start.
> >>
> >> feels like an over-optimization
> >> in fact, the whole thing feels like that if i'm honest.
> >
> > Huh ... You didn't seem to think that when we discussed automating
> > service-disable at length in Austin.
> 
> I didn't feel the need to push back because RH uses the systemd agent
> instead so you're only hanging yourself, but more importantly because
> the proposed implementation to facilitate it wasn't leading RA writers
> down a hazardous path :-)

I'm a bit confused by that statement, because the only proposed
implementation we came up with in Austin was adding this new feature
to Pacemaker.  Prior to that, AFAICR, you, Dawid, and I had a long
afternoon discussion in the sun where we tried to figure out a way to
implement it just by tweaking the OCF RAs, but every approach we
discussed turned out to have fundamental issues.  That's why we
eventually turned to the idea of this new feature in Pacemaker.

But anyway, it's water under the bridge now :-)

> > What changed?  Can you suggest a better approach?
> 
> Either always or never disable the service would be my advice.
> "Always" specifically getting my vote.

OK, thanks.  We discussed that at the meeting this morning, and it
looks like we'll give it a try.

> >> why are we trying to optimise the projected performance impact
> >
> > It's not really "projected"; we know exactly what the impact is.  And
> > it's not really a performance impact either.  If nova-compute (or a
> > dependency) is malfunctioning on a compute node, there will be a
> > window (bounded by nova.conf's rpc_response_timeout value, IIUC) in
> > which nova-scheduler could still schedule VMs onto that compute node,
> > and then of course they'll fail to boot.
> 
> Right, but that window exists regardless of whether the node is or is
> not ever coming back.

Sure, but the window's a *lot* bigger if we don't do service-disable.
Although perhaps your question "why are we trying to optimise the
projected performance impact" was actually "why are we trying to avoid
extra calls to service-disable" rather than "why do we want to call
service-disable" as I initially assumed.  Is that right?

> And as we already discussed, the proposed feature still leaves you
> open to this window because we can't know if the expected restart will
> ever happen.

Yes, but as I already said, the perfect should not become the enemy of
the good.  Just because an approach doesn't solve all cases, it
doesn't necessarily mean it's not suitable for solving some of them.

> In this context, trying to avoid the disable call under certain
> circumstances, to avoid repeated and frequent flip-flopping of the
> state, seems ill-advised.  At the point nova compute is bouncing up
> and down like that, you have a more fundamental issue somewhere in
> your stack and this is only one (and IMHO minor) symptom of it.

That's a fair point.

> > The masakari folks have a lot of operational experience in this space,
> > and they found that this was enough of a problem to justify calling
> > nova service-disable whenever the failure is detected.
> 
> If you really want it whenever the failure is detected, call it from
> the monitor operation that finds it broken.

Hmm, that appears to violate what I assume would be a fundamental
design principle of Pacemaker: that the "monitor" action never changes
the system's state (assuming there are no Heisenberg-like side effects
of monitoring, of course).  I guess you could argue that in this case,
the nova server's internal state could be considered outside the
system which Pacemaker is managing.

> I'm arguing that trying to do it only failure is an over optimization
> and probably a mistake.

OK.

[snipped]

> >> > The new feature will be obscure enough that
> >> > noone would be able to use it without reading the corresponding
> >> > documentation first anyway.
> >>
> >> I like your optimism.
> 
> As a general rule, people do not read documentation.
> They see an option, decide what it does based on the name and move on
> if some limited testing appears to confirm their theory.

How would else they see the option without reading the documentation?
It seems pretty unlikely that someone outside the OpenStack HA
community would happen to read the NovaCompute RA source code, notice
it, and decide to use it without doing any further research.  And if
they did, and somehow got it wrong, that would still be caught when
they submitted the usage to the resource-agents project.  And if they
didn't submit the RA for review anywhere then the chances are high
that they have other serious problems with it ...

> >> >> I think with naming and documenting it properly, I'm fine to provide the
> >> >> option, but I'm on the fence. Beekhof needs a little more convincing :-)
> >> >
> >> > Can you provide an example of a potential real-world situation where
> >> > an RA author would end up accidentally abusing the feature?
> >>
> >> You want a real-world example of how someone could accidentally
> >> mis-using a feature that doesn't exist yet?
> >>
> >> Um... if we knew all the weird and wonderful ways people break our
> >> code we'd be able to build a better mouse trap.
> >
> > So what are you suggesting?  That we should deliberately avoid making
> > any progress, based on nebulous fear of other people making stupid
> > mistakes in ways that we can't even think of?
> 
> Its not nebulous, we've seen for many years now how these things get
> used and how hard it is to walk back design errors.  You don't need to
> maintain it or deal with the people 5 or 10 years from now that are
> still getting themselves into trouble as a result, so of course you
> set the bar lower.
> 
> Heck, I've seen how long it took people to stop abusing "target_rc" in
> their monitor functions (hint, it wasn't voluntary -
> https://github.com/beekhof/pacemaker/commit/46a65a8 ).

OK then, I'll defer to your experience in this area, even though it
seems a bit overly paranoid to me.

> >  I'm totally open to
> > other ideas, but I'm not hearing any yet.
> 
> I'm saying that this is not progress.
> That even the usecase the feature was designed for shouldn't use it.
> That the service should be unconditionally disabled on stop and
> enabled on start.
> That the desire for conditional enabling/disabling is rooted in the
> unnecessary recovery optimization of an unrecoverable system.
> 
> Clear enough?

That's a lot clearer, thanks.  We discussed this topic at the HA
meeting today and agreed to try this approach.  Hopefully you will be
proven correct and there will be no need for further optimization :-)

BTW I disagree with the characterization of *all* these scenarios as
"unrecoverable" (even though it's correct for some of them), but
that's a minor nitpick which probably isn't worth discussing.