[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Thu Jun 23 17:01:02 CEST 2016

Andrew Beekhof <abeekhof at redhat.com> wrote:
> On Wed, Jun 15, 2016 at 10:42 PM, Adam Spiers <aspiers at suse.com> wrote:
> > Andrew Beekhof <abeekhof at redhat.com> wrote:
> >> On Mon, Jun 13, 2016 at 9:34 PM, Adam Spiers <aspiers at suse.com> wrote:
> >> > Andrew Beekhof <abeekhof at redhat.com> wrote:
> >> >> On Wed, Jun 8, 2016 at 6:23 PM, Adam Spiers <aspiers at suse.com> wrote:
> >> >> > Andrew Beekhof <abeekhof at redhat.com> wrote:
> >> >> >> On Wed, Jun 8, 2016 at 12:11 AM, Adam Spiers <aspiers at suse.com> wrote:
> >> >> >> > We would also need to ensure that service-enable is called on start
> >> >> >> > when necessary.  Perhaps we could track the enable/disable state in a
> >> >> >> > local temporary file, and if the file indicates that we've previously
> >> >> >> > done service-disable, we know to run service-enable on start.  This
> >> >> >> > would avoid calling service-enable on every single start.
> >> >> >>
> >> >> >> feels like an over-optimization
> >> >> >> in fact, the whole thing feels like that if i'm honest.
> >> >> >
> >> >> > Huh ... You didn't seem to think that when we discussed automating
> >> >> > service-disable at length in Austin.
> >> >>
> >> >> I didn't feel the need to push back because RH uses the systemd agent
> >> >> instead so you're only hanging yourself, but more importantly because
> >> >> the proposed implementation to facilitate it wasn't leading RA writers
> >> >> down a hazardous path :-)
> >> >
> >> > I'm a bit confused by that statement, because the only proposed
> >> > implementation we came up with in Austin was adding this new feature
> >> > to Pacemaker.
> >>
> >> _A_ new feature, not _this_ new feature.
> >> The one we discussed was far less prone to being abused but, as it
> >> turns out, also far less useful for what you were trying to do.
> >
> > Was there really that much significant change since the original idea?
> > IIRC the only thing which really changed was the type, from "number of
> > retries remaining" to a boolean "there are still some retries" left.
> 
> The new implementation has nothing to do with retries. Like the new
> name, it is based on "is a start action expected".

Oh yeah, I remember now.

> Thats why I got an attack of the heebie-jeebies.

I'm not sure why, but at least now I understand your change of
position :-)

> > I'm not sure why the integer approach would be far less open to abuse,
> > or even why it would have been far less useful.  I'm probably missing
> > something.
> >
> > [snipped]
> >
> >> >> >> why are we trying to optimise the projected performance impact
> >> >> >
> >> >> > It's not really "projected"; we know exactly what the impact is.  And
> >> >> > it's not really a performance impact either.  If nova-compute (or a
> >> >> > dependency) is malfunctioning on a compute node, there will be a
> >> >> > window (bounded by nova.conf's rpc_response_timeout value, IIUC) in
> >> >> > which nova-scheduler could still schedule VMs onto that compute node,
> >> >> > and then of course they'll fail to boot.
> >> >>
> >> >> Right, but that window exists regardless of whether the node is or is
> >> >> not ever coming back.
> >> >
> >> > Sure, but the window's a *lot* bigger if we don't do service-disable.
> >> > Although perhaps your question "why are we trying to optimise the
> >> > projected performance impact" was actually "why are we trying to avoid
> >> > extra calls to service-disable" rather than "why do we want to call
> >> > service-disable" as I initially assumed.  Is that right?
> >>
> >> Exactly.  I assumed it was to limit the noise we'd be generating in doing so.
> >
> > Sort of - not just the noise, but the extra delay introduced by
> > calling service-disable, restarting nova-compute, and then calling
> > service-enable again when it succeeds.
> 
> Ok, but restarting nova-compute is not optional and the bits that are
> optional are all but completely asynchronous* - so the overhead should
> be negligible.
> 
> * Like most API calls, they are Ack'd when the request has been
> received, not processed.

Yes, fair points.

> >> >> > The masakari folks have a lot of operational experience in this space,
> >> >> > and they found that this was enough of a problem to justify calling
> >> >> > nova service-disable whenever the failure is detected.
> >> >>
> >> >> If you really want it whenever the failure is detected, call it from
> >> >> the monitor operation that finds it broken.
> >> >
> >> > Hmm, that appears to violate what I assume would be a fundamental
> >> > design principle of Pacemaker: that the "monitor" action never changes
> >> > the system's state (assuming there are no Heisenberg-like side effects
> >> > of monitoring, of course).
> >>
> >> That has traditionally been the considered a good idea, in the vast
> >> majority of cases I still think it is a good idea, but its also a
> >> guideline that has been broken because there is no other way for the
> >> agent to work *cough* rabbit *cough*.
> >>
> >> In this specific case, I think it could be forgivable because you're
> >> not strictly altering the service but something that sits in front of
> >> it.  start/stop/monitor would all continue to do TheRightThing(tm).
> >>
> >> > I guess you could argue that in this case,
> >> > the nova server's internal state could be considered outside the
> >> > system which Pacemaker is managing.
> >>
> >> Right.
> >
> > Well, if you're OK with bending the rules like this then that's good
> > enough for me to say we should at least try it :)
> 
> I still say you shouldn't only do it on error.

When else should it be done?

IIUC, disabling/enabling the service is independent of the up/down
state which nova tracks automatically, and which based on slightly
more than a skim of the code, is dependent on the state of the RPC
layer.

> > But how would you avoid repeated consecutive invocations of "nova
> > service-disable" when the monitor action fails, and ditto for "nova
> > service-enable" when it succeeds?
> 
> I don't think you can. Not ideal but I'd not have thought a deal breaker.

Sounds like a massive deal-breaker to me!  With op monitor
interval="10s" and 100 compute nodes, that would mean 10 pointless
calls to nova-api every second.  Am I missing something?

Also I don't see any benefit to moving the API calls from start/stop
actions to the monitor action.  If there's a failure, Pacemaker will
invoke the stop action, so we can do service-disable there.  If the
start action is invoked and we successfully initiate startup of
nova-compute, the RA can undo any service-disable it previously did
(although it should not reverse a service-disable done elsewhere,
e.g. manually by the cloud operator).

> > Earlier in this thread I proposed
> > the idea of a tiny temporary file in /run which tracks the last known
> > state and optimizes away the consecutive invocations, but IIRC you
> > were against that.
> 
> I'm generally not a fan, but sometimes state files are a necessity.
> Just make sure you think through what a missing file might mean.

Sure.  A missing file would mean the RA's never called service-disable
before, which means that it shouldn't call service-enable on startup.

> Unless.... use the state file to store the date at which the last
> start operation occurred?
> 
> If we're calling stop() and data - start_date > threshold, then, if
> you must, be optimistic, skip service-disable and assume we'll get
> started again soon.
> 
> Otherwise if we're calling stop() and data - start_date <= threshold,
> always call service-disable because we're in a restart loop which is
> not worth optimising for.
> 
> ( And always call service-enable at start() )
> 
> No Pacemaker feature or Beekhof approval required :-)

Hmm ...  it's possible I just don't understand this proposal fully,
but it sounds a bit woolly to me, e.g. how would you decide a suitable
threshold?

I think I preferred your other suggestion of just skipping the
optimization, i.e. calling service-disable on the first stop, and
service-enable on (almost) every start.

[snipped]

> > OK, so by "unnecessary recovery optimization of an unrecoverable
> > system" you meant "unnecessary recovery optimization of a system which
> > we already tried and failed to recover",
> 
> right
> 
> > whereas I thought you meant
> > "unnecessary recovery optimization of a failed system which we haven't
> > tried to recover yet".   But the optimization we are talking about
> > (i.e. skipping "nova service-disable") would only happen during
> > recovery attempts, before we finally give up, so it only applies to
> > the latter meaning, not the former.
> 
> Strictly speaking, I would disagree.
> 
> The first time we call stop() is the only point at which the "we
> haven't tried to recover yet" claus is true.
> 
> Yet even then we do not and can not know if start() will ever be
> called, nor if it would be here or on another node, nor if any future
> invocation will succeed.  So there is no real justification for the
> existence of an optimistic code path in stop() - which is why I keep
> banging on about correctness trumping optimisation.

That makes sense.

> [This is the point I came up with the (date - start_date) stuff above]
> 
> If you're hell bent on some kind of first-time optimisation though,

No, I'm not.  As I said, I'm happy to try this.  So, in untested
pseudo code:

    stop () {
        # "Best effort" attempt to tell nova the service is
        # unavailable; we cannot rely on this to succeed (since if the
        # stop is due to a failure, that failure might also be
        # creating issues with nova-api), so it has to be
        # asynchronous.
        #
        # N.B. At this point the service might have already been
        # disabled externally, e.g. manually by the cloud operator for
        # planned maintenance.
        attempt_nova_service_disable

        ensure_nova_compute_stopped
    }

    start () {
        if ! start_nova_compute; then
            # Start-up failed, so we don't want to reenable.
            return
        fi

        # In contrast to attempt_nova_service_disable, this has to be
        # synchronous, since if we can't enable the service, there's
        # no point running it and pretending everything's OK.
        #
        # FIXME: we don't want to do this if something else other than
        # the stop() action previously called service-disable.
        ensure_nova_service_enabled
    }

    monitor () {
        # Here some application-level probe is better than a PID check.
    }

As per the FIXME, one remaining problem is dealing with this kind of
scenario:

  - Cloud operator notices SMART warnings on the compute node
    which is not yet causing hard failures but signifies that the
    hard disk might die soon.

  - Operator manually runs "nova service-disable" with the intention
    of doing some maintenance soon, i.e. live-migrating instances away
    and replacing the dying hard disk.

  - Before the operator gracefully shuts down nova-compute, an I/O
    error from the disk causes nova-compute to fail.

  - Pacemaker invokes the monitor action which spots the failure.

  - Pacemaker invokes the stop action which runs service-disable.

  - Pacemaker attempts to restart nova-compute by invoking the start
    action.  Since the disk failure is currently intermittent, we
    get (un)lucky and nova-compute starts fine.

    Then it calls service-enable - BAD!  This is now overriding the
    cloud operator's manual request for the service to be disabled.
    If we're really unlucky, nova-scheduler will now start up new VMs
    on the node, even though the hard disk is dying.

However I can't see a way to defend against this :-/

Initially I thought that keeping track of when service-disable is
called by the RA would help distinguish from when it is called by
the cloud operator (or someone else), e.g.:

    stop () {
        if ! [ -e $service_disabled_state_file ]; then
            # "Best effort" attempt to tell nova the service is
            # unavailable; we cannot rely on this to succeed (since
            # if the stop is due to a failure, that failure might also
            # be creating issues with nova-api), so it has to be
            # asynchronous.
            #
            # N.B. At this point the service might have already
            # been disabled externally, e.g. manually by the cloud
            # operator for planned maintenance.
            attempt_nova_service_disable

            # Optimistically assume the service-disable succeeded,
            # so that even if something went wrong, we make sure
            # to do a service-enable when we restart it.
            touch $service_disabled_state_file
        fi

        ensure_nova_compute_stopped
    }

    start () {
        if ! start_nova_compute; then
            # Start-up failed, so we don't want to reenable.
            return
        fi

        if [ -e $service_disabled_state_file ]; then
            # In contrast to attempt_nova_service_disable, this
            # has to be synchronous, since if we can't enable the
            # service, there's no point running it and pretending
            # everything's OK.
            #
            # FIXME: we don't want to do this if something else
            # other than the stop() action previously called
            # service-disable.
            ensure_nova_service_enabled

            rm $service_disabled_state_file
        fi
    }

    monitor () {
        # Here some application-level probe is better than a PID check.
    }

But with further thought, I'm not sure that state file brings any
benefit, because it would only optimize away nova-api calls if either
start was called several times consecutively, or stop was, and that
won't happen.

Another problem is that if a stop was rapidly followed by a start
(which we can expect with say, migration-threshold=1), the
asynchronous attempt_nova_service_disable might not complete until
after the synchronous ensure_nova_service_enabled completed, in which
case we'd have a running nova-compute which was accidentally disabled.

Thoughts?