[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Wed Jun 15 20:39:43 EDT 2016

On Wed, Jun 15, 2016 at 10:42 PM, Adam Spiers <aspiers at suse.com> wrote:
> Andrew Beekhof <abeekhof at redhat.com> wrote:
>> On Mon, Jun 13, 2016 at 9:34 PM, Adam Spiers <aspiers at suse.com> wrote:
>> > Andrew Beekhof <abeekhof at redhat.com> wrote:
>> >> On Wed, Jun 8, 2016 at 6:23 PM, Adam Spiers <aspiers at suse.com> wrote:
>> >> > Andrew Beekhof <abeekhof at redhat.com> wrote:
>> >> >> On Wed, Jun 8, 2016 at 12:11 AM, Adam Spiers <aspiers at suse.com> wrote:
>> >> >> > We would also need to ensure that service-enable is called on start
>> >> >> > when necessary.  Perhaps we could track the enable/disable state in a
>> >> >> > local temporary file, and if the file indicates that we've previously
>> >> >> > done service-disable, we know to run service-enable on start.  This
>> >> >> > would avoid calling service-enable on every single start.
>> >> >>
>> >> >> feels like an over-optimization
>> >> >> in fact, the whole thing feels like that if i'm honest.
>> >> >
>> >> > Huh ... You didn't seem to think that when we discussed automating
>> >> > service-disable at length in Austin.
>> >>
>> >> I didn't feel the need to push back because RH uses the systemd agent
>> >> instead so you're only hanging yourself, but more importantly because
>> >> the proposed implementation to facilitate it wasn't leading RA writers
>> >> down a hazardous path :-)
>> >
>> > I'm a bit confused by that statement, because the only proposed
>> > implementation we came up with in Austin was adding this new feature
>> > to Pacemaker.
>>
>> _A_ new feature, not _this_ new feature.
>> The one we discussed was far less prone to being abused but, as it
>> turns out, also far less useful for what you were trying to do.
>
> Was there really that much significant change since the original idea?
> IIRC the only thing which really changed was the type, from "number of
> retries remaining" to a boolean "there are still some retries" left.

The new implementation has nothing to do with retries. Like the new
name, it is based on "is a start action expected".
Thats why I got an attack of the heebie-jeebies.

> I'm not sure why the integer approach would be far less open to abuse,
> or even why it would have been far less useful.  I'm probably missing
> something.
>
> [snipped]
>
>> >> >> why are we trying to optimise the projected performance impact
>> >> >
>> >> > It's not really "projected"; we know exactly what the impact is.  And
>> >> > it's not really a performance impact either.  If nova-compute (or a
>> >> > dependency) is malfunctioning on a compute node, there will be a
>> >> > window (bounded by nova.conf's rpc_response_timeout value, IIUC) in
>> >> > which nova-scheduler could still schedule VMs onto that compute node,
>> >> > and then of course they'll fail to boot.
>> >>
>> >> Right, but that window exists regardless of whether the node is or is
>> >> not ever coming back.
>> >
>> > Sure, but the window's a *lot* bigger if we don't do service-disable.
>> > Although perhaps your question "why are we trying to optimise the
>> > projected performance impact" was actually "why are we trying to avoid
>> > extra calls to service-disable" rather than "why do we want to call
>> > service-disable" as I initially assumed.  Is that right?
>>
>> Exactly.  I assumed it was to limit the noise we'd be generating in doing so.
>
> Sort of - not just the noise, but the extra delay introduced by
> calling service-disable, restarting nova-compute, and then calling
> service-enable again when it succeeds.

Ok, but restarting nova-compute is not optional and the bits that are
optional are all but completely asynchronous* - so the overhead should
be negligible.

* Like most API calls, they are Ack'd when the request has been
received, not processed.

>
> [snipped]
>
>> >> > The masakari folks have a lot of operational experience in this space,
>> >> > and they found that this was enough of a problem to justify calling
>> >> > nova service-disable whenever the failure is detected.
>> >>
>> >> If you really want it whenever the failure is detected, call it from
>> >> the monitor operation that finds it broken.
>> >
>> > Hmm, that appears to violate what I assume would be a fundamental
>> > design principle of Pacemaker: that the "monitor" action never changes
>> > the system's state (assuming there are no Heisenberg-like side effects
>> > of monitoring, of course).
>>
>> That has traditionally been the considered a good idea, in the vast
>> majority of cases I still think it is a good idea, but its also a
>> guideline that has been broken because there is no other way for the
>> agent to work *cough* rabbit *cough*.
>>
>> In this specific case, I think it could be forgivable because you're
>> not strictly altering the service but something that sits in front of
>> it.  start/stop/monitor would all continue to do TheRightThing(tm).
>>
>> > I guess you could argue that in this case,
>> > the nova server's internal state could be considered outside the
>> > system which Pacemaker is managing.
>>
>> Right.
>
> Well, if you're OK with bending the rules like this then that's good
> enough for me to say we should at least try it :)

I still say you shouldn't only do it on error.

>
> But how would you avoid repeated consecutive invocations of "nova
> service-disable" when the monitor action fails, and ditto for "nova
> service-enable" when it succeeds?

I don't think you can. Not ideal but I'd not have thought a deal breaker.

> Earlier in this thread I proposed
> the idea of a tiny temporary file in /run which tracks the last known
> state and optimizes away the consecutive invocations, but IIRC you
> were against that.

I'm generally not a fan, but sometimes state files are a necessity.
Just make sure you think through what a missing file might mean.

Unless.... use the state file to store the date at which the last
start operation occurred?

If we're calling stop() and data - start_date > threshold, then, if
you must, be optimistic, skip service-disable and assume we'll get
started again soon.

Otherwise if we're calling stop() and data - start_date <= threshold,
always call service-disable because we're in a restart loop which is
not worth optimising for.

( And always call service-enable at start() )

No Pacemaker feature or Beekhof approval required :-)

>
>> >> I'm arguing that trying to do it only failure is an over optimization
>> >> and probably a mistake.
>> >
>> > OK.
>> >
>> > [snipped]
>> >
>> >> >> > The new feature will be obscure enough that
>> >> >> > noone would be able to use it without reading the corresponding
>> >> >> > documentation first anyway.
>> >> >>
>> >> >> I like your optimism.
>> >>
>> >> As a general rule, people do not read documentation.
>> >> They see an option, decide what it does based on the name and move on
>> >> if some limited testing appears to confirm their theory.
>> >
>> > How would else they see the option without reading the documentation?
>>
>> Same way I've discovered things in the past in similar environments:
>> env | sort -u
>
> They could also read the source.  Possible?  Sure.  Likely to happen
> often?  Not so much.

Strongly disagree.
We removed target_rc from the documentation yet people still found and
abused it for years.

>
>> > It seems pretty unlikely that someone outside the OpenStack HA
>> > community would happen to read the NovaCompute RA source code, notice
>> > it, and decide to use it without doing any further research.
>>
>> Ah, but people are already advocating that it should default to "=no"
>> everywhere else.
>
> Sorry, I don't follow what you mean by this.

A value will be set in every call to every agent, making it visible to
anyone poking around with 'env'.
Which people absolutely do.

>
>> These things always snowball.
>>
>> >  And if
>> > they did, and somehow got it wrong, that would still be caught when
>> > they submitted the usage to the resource-agents project.
>>
>> Not all agents come back to us.
>
> Which is why I said:
>
>> >  And if they
>> > didn't submit the RA for review anywhere then the chances are high
>> > that they have other serious problems with it ...

Just because they don't submit it for inclusion doesn't mean they wont
come looking for support though.

>
> [snipped]
>
>> >> > So what are you suggesting?  That we should deliberately avoid making
>> >> > any progress, based on nebulous fear of other people making stupid
>> >> > mistakes in ways that we can't even think of?
>> >>
>> >> Its not nebulous, we've seen for many years now how these things get
>> >> used and how hard it is to walk back design errors.  You don't need to
>> >> maintain it or deal with the people 5 or 10 years from now that are
>> >> still getting themselves into trouble as a result, so of course you
>> >> set the bar lower.
>> >>
>> >> Heck, I've seen how long it took people to stop abusing "target_rc" in
>> >> their monitor functions (hint, it wasn't voluntary -
>> >> https://github.com/beekhof/pacemaker/commit/46a65a8 ).
>> >
>> > OK then, I'll defer to your experience in this area, even though it
>> > seems a bit overly paranoid to me.
>>
>> 7 years of "no, you need to return the true state of the service, not
>> what target_rc contains" is overly paranoid?
>
> Return codes are a fundamental part of every RA action, so it's
> unavoidable that every RA author has to be aware of them and use
> them.  That's totally different to this case of some obscure optional
> environment variable.  That's my opinion anyway, but you don't have to
> agree :-)
>
>> >> >  I'm totally open to
>> >> > other ideas, but I'm not hearing any yet.
>> >>
>> >> I'm saying that this is not progress.
>> >> That even the usecase the feature was designed for shouldn't use it.
>> >> That the service should be unconditionally disabled on stop and
>> >> enabled on start.
>> >> That the desire for conditional enabling/disabling is rooted in the
>> >> unnecessary recovery optimization of an unrecoverable system.
>> >>
>> >> Clear enough?
>> >
>> > That's a lot clearer, thanks.  We discussed this topic at the HA
>> > meeting today and agreed to try this approach.  Hopefully you will be
>> > proven correct and there will be no need for further optimization :-)
>> >
>> > BTW I disagree with the characterization of *all* these scenarios as
>> > "unrecoverable" (even though it's correct for some of them), but
>> > that's a minor nitpick which probably isn't worth discussing.
>>
>> I'd class them as unrecoverable because whatever we've already tried
>> (restarting a set of services, including nova compute, several times)
>> clearly isn't achieving anything.
>
> OK, so by "unnecessary recovery optimization of an unrecoverable
> system" you meant "unnecessary recovery optimization of a system which
> we already tried and failed to recover",

right

> whereas I thought you meant
> "unnecessary recovery optimization of a failed system which we haven't
> tried to recover yet".   But the optimization we are talking about
> (i.e. skipping "nova service-disable") would only happen during
> recovery attempts, before we finally give up, so it only applies to
> the latter meaning, not the former.

Strictly speaking, I would disagree.

The first time we call stop() is the only point at which the "we
haven't tried to recover yet" claus is true.

Yet even then we do not and can not know if start() will ever be
called, nor if it would be here or on another node, nor if any future
invocation will succeed.  So there is no real justification for the
existence of an optimistic code path in stop() - which is why I keep
banging on about correctness trumping optimisation.

[This is the point I came up with the (date - start_date) stuff above]

If you're hell bent on some kind of first-time optimisation though,
recording the last start date has the best benefit/complexity tradeoff
I can think of.

>
> IOW, in the case where a restart actually succeeds, we had a
> *recoverable* system.

True, but there is no way the stop() action can know this ahead of time.

>
>> Whatever the problem is, it is outside of the cluster's ability to
>> compensate for.
>
> In the non-recoverable case, I agree.
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org