[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?
Adam Spiers
aspiers at suse.com
Thu Jun 23 15:01:02 UTC 2016
Andrew Beekhof <abeekhof at redhat.com> wrote:
> On Wed, Jun 15, 2016 at 10:42 PM, Adam Spiers <aspiers at suse.com> wrote:
> > Andrew Beekhof <abeekhof at redhat.com> wrote:
> >> On Mon, Jun 13, 2016 at 9:34 PM, Adam Spiers <aspiers at suse.com> wrote:
> >> > Andrew Beekhof <abeekhof at redhat.com> wrote:
> >> >> On Wed, Jun 8, 2016 at 6:23 PM, Adam Spiers <aspiers at suse.com> wrote:
> >> >> > Andrew Beekhof <abeekhof at redhat.com> wrote:
> >> >> >> On Wed, Jun 8, 2016 at 12:11 AM, Adam Spiers <aspiers at suse.com> wrote:
> >> >> >> > We would also need to ensure that service-enable is called on start
> >> >> >> > when necessary. Perhaps we could track the enable/disable state in a
> >> >> >> > local temporary file, and if the file indicates that we've previously
> >> >> >> > done service-disable, we know to run service-enable on start. This
> >> >> >> > would avoid calling service-enable on every single start.
> >> >> >>
> >> >> >> feels like an over-optimization
> >> >> >> in fact, the whole thing feels like that if i'm honest.
> >> >> >
> >> >> > Huh ... You didn't seem to think that when we discussed automating
> >> >> > service-disable at length in Austin.
> >> >>
> >> >> I didn't feel the need to push back because RH uses the systemd agent
> >> >> instead so you're only hanging yourself, but more importantly because
> >> >> the proposed implementation to facilitate it wasn't leading RA writers
> >> >> down a hazardous path :-)
> >> >
> >> > I'm a bit confused by that statement, because the only proposed
> >> > implementation we came up with in Austin was adding this new feature
> >> > to Pacemaker.
> >>
> >> _A_ new feature, not _this_ new feature.
> >> The one we discussed was far less prone to being abused but, as it
> >> turns out, also far less useful for what you were trying to do.
> >
> > Was there really that much significant change since the original idea?
> > IIRC the only thing which really changed was the type, from "number of
> > retries remaining" to a boolean "there are still some retries" left.
>
> The new implementation has nothing to do with retries. Like the new
> name, it is based on "is a start action expected".
Oh yeah, I remember now.
> Thats why I got an attack of the heebie-jeebies.
I'm not sure why, but at least now I understand your change of
position :-)
> > I'm not sure why the integer approach would be far less open to abuse,
> > or even why it would have been far less useful. I'm probably missing
> > something.
> >
> > [snipped]
> >
> >> >> >> why are we trying to optimise the projected performance impact
> >> >> >
> >> >> > It's not really "projected"; we know exactly what the impact is. And
> >> >> > it's not really a performance impact either. If nova-compute (or a
> >> >> > dependency) is malfunctioning on a compute node, there will be a
> >> >> > window (bounded by nova.conf's rpc_response_timeout value, IIUC) in
> >> >> > which nova-scheduler could still schedule VMs onto that compute node,
> >> >> > and then of course they'll fail to boot.
> >> >>
> >> >> Right, but that window exists regardless of whether the node is or is
> >> >> not ever coming back.
> >> >
> >> > Sure, but the window's a *lot* bigger if we don't do service-disable.
> >> > Although perhaps your question "why are we trying to optimise the
> >> > projected performance impact" was actually "why are we trying to avoid
> >> > extra calls to service-disable" rather than "why do we want to call
> >> > service-disable" as I initially assumed. Is that right?
> >>
> >> Exactly. I assumed it was to limit the noise we'd be generating in doing so.
> >
> > Sort of - not just the noise, but the extra delay introduced by
> > calling service-disable, restarting nova-compute, and then calling
> > service-enable again when it succeeds.
>
> Ok, but restarting nova-compute is not optional and the bits that are
> optional are all but completely asynchronous* - so the overhead should
> be negligible.
>
> * Like most API calls, they are Ack'd when the request has been
> received, not processed.
Yes, fair points.
> >> >> > The masakari folks have a lot of operational experience in this space,
> >> >> > and they found that this was enough of a problem to justify calling
> >> >> > nova service-disable whenever the failure is detected.
> >> >>
> >> >> If you really want it whenever the failure is detected, call it from
> >> >> the monitor operation that finds it broken.
> >> >
> >> > Hmm, that appears to violate what I assume would be a fundamental
> >> > design principle of Pacemaker: that the "monitor" action never changes
> >> > the system's state (assuming there are no Heisenberg-like side effects
> >> > of monitoring, of course).
> >>
> >> That has traditionally been the considered a good idea, in the vast
> >> majority of cases I still think it is a good idea, but its also a
> >> guideline that has been broken because there is no other way for the
> >> agent to work *cough* rabbit *cough*.
> >>
> >> In this specific case, I think it could be forgivable because you're
> >> not strictly altering the service but something that sits in front of
> >> it. start/stop/monitor would all continue to do TheRightThing(tm).
> >>
> >> > I guess you could argue that in this case,
> >> > the nova server's internal state could be considered outside the
> >> > system which Pacemaker is managing.
> >>
> >> Right.
> >
> > Well, if you're OK with bending the rules like this then that's good
> > enough for me to say we should at least try it :)
>
> I still say you shouldn't only do it on error.
When else should it be done?
IIUC, disabling/enabling the service is independent of the up/down
state which nova tracks automatically, and which based on slightly
more than a skim of the code, is dependent on the state of the RPC
layer.
> > But how would you avoid repeated consecutive invocations of "nova
> > service-disable" when the monitor action fails, and ditto for "nova
> > service-enable" when it succeeds?
>
> I don't think you can. Not ideal but I'd not have thought a deal breaker.
Sounds like a massive deal-breaker to me! With op monitor
interval="10s" and 100 compute nodes, that would mean 10 pointless
calls to nova-api every second. Am I missing something?
Also I don't see any benefit to moving the API calls from start/stop
actions to the monitor action. If there's a failure, Pacemaker will
invoke the stop action, so we can do service-disable there. If the
start action is invoked and we successfully initiate startup of
nova-compute, the RA can undo any service-disable it previously did
(although it should not reverse a service-disable done elsewhere,
e.g. manually by the cloud operator).
> > Earlier in this thread I proposed
> > the idea of a tiny temporary file in /run which tracks the last known
> > state and optimizes away the consecutive invocations, but IIRC you
> > were against that.
>
> I'm generally not a fan, but sometimes state files are a necessity.
> Just make sure you think through what a missing file might mean.
Sure. A missing file would mean the RA's never called service-disable
before, which means that it shouldn't call service-enable on startup.
> Unless.... use the state file to store the date at which the last
> start operation occurred?
>
> If we're calling stop() and data - start_date > threshold, then, if
> you must, be optimistic, skip service-disable and assume we'll get
> started again soon.
>
> Otherwise if we're calling stop() and data - start_date <= threshold,
> always call service-disable because we're in a restart loop which is
> not worth optimising for.
>
> ( And always call service-enable at start() )
>
> No Pacemaker feature or Beekhof approval required :-)
Hmm ... it's possible I just don't understand this proposal fully,
but it sounds a bit woolly to me, e.g. how would you decide a suitable
threshold?
I think I preferred your other suggestion of just skipping the
optimization, i.e. calling service-disable on the first stop, and
service-enable on (almost) every start.
[snipped]
> > OK, so by "unnecessary recovery optimization of an unrecoverable
> > system" you meant "unnecessary recovery optimization of a system which
> > we already tried and failed to recover",
>
> right
>
> > whereas I thought you meant
> > "unnecessary recovery optimization of a failed system which we haven't
> > tried to recover yet". But the optimization we are talking about
> > (i.e. skipping "nova service-disable") would only happen during
> > recovery attempts, before we finally give up, so it only applies to
> > the latter meaning, not the former.
>
> Strictly speaking, I would disagree.
>
> The first time we call stop() is the only point at which the "we
> haven't tried to recover yet" claus is true.
>
> Yet even then we do not and can not know if start() will ever be
> called, nor if it would be here or on another node, nor if any future
> invocation will succeed. So there is no real justification for the
> existence of an optimistic code path in stop() - which is why I keep
> banging on about correctness trumping optimisation.
That makes sense.
> [This is the point I came up with the (date - start_date) stuff above]
>
> If you're hell bent on some kind of first-time optimisation though,
No, I'm not. As I said, I'm happy to try this. So, in untested
pseudo code:
stop () {
# "Best effort" attempt to tell nova the service is
# unavailable; we cannot rely on this to succeed (since if the
# stop is due to a failure, that failure might also be
# creating issues with nova-api), so it has to be
# asynchronous.
#
# N.B. At this point the service might have already been
# disabled externally, e.g. manually by the cloud operator for
# planned maintenance.
attempt_nova_service_disable
ensure_nova_compute_stopped
}
start () {
if ! start_nova_compute; then
# Start-up failed, so we don't want to reenable.
return
fi
# In contrast to attempt_nova_service_disable, this has to be
# synchronous, since if we can't enable the service, there's
# no point running it and pretending everything's OK.
#
# FIXME: we don't want to do this if something else other than
# the stop() action previously called service-disable.
ensure_nova_service_enabled
}
monitor () {
# Here some application-level probe is better than a PID check.
}
As per the FIXME, one remaining problem is dealing with this kind of
scenario:
- Cloud operator notices SMART warnings on the compute node
which is not yet causing hard failures but signifies that the
hard disk might die soon.
- Operator manually runs "nova service-disable" with the intention
of doing some maintenance soon, i.e. live-migrating instances away
and replacing the dying hard disk.
- Before the operator gracefully shuts down nova-compute, an I/O
error from the disk causes nova-compute to fail.
- Pacemaker invokes the monitor action which spots the failure.
- Pacemaker invokes the stop action which runs service-disable.
- Pacemaker attempts to restart nova-compute by invoking the start
action. Since the disk failure is currently intermittent, we
get (un)lucky and nova-compute starts fine.
Then it calls service-enable - BAD! This is now overriding the
cloud operator's manual request for the service to be disabled.
If we're really unlucky, nova-scheduler will now start up new VMs
on the node, even though the hard disk is dying.
However I can't see a way to defend against this :-/
Initially I thought that keeping track of when service-disable is
called by the RA would help distinguish from when it is called by
the cloud operator (or someone else), e.g.:
stop () {
if ! [ -e $service_disabled_state_file ]; then
# "Best effort" attempt to tell nova the service is
# unavailable; we cannot rely on this to succeed (since
# if the stop is due to a failure, that failure might also
# be creating issues with nova-api), so it has to be
# asynchronous.
#
# N.B. At this point the service might have already
# been disabled externally, e.g. manually by the cloud
# operator for planned maintenance.
attempt_nova_service_disable
# Optimistically assume the service-disable succeeded,
# so that even if something went wrong, we make sure
# to do a service-enable when we restart it.
touch $service_disabled_state_file
fi
ensure_nova_compute_stopped
}
start () {
if ! start_nova_compute; then
# Start-up failed, so we don't want to reenable.
return
fi
if [ -e $service_disabled_state_file ]; then
# In contrast to attempt_nova_service_disable, this
# has to be synchronous, since if we can't enable the
# service, there's no point running it and pretending
# everything's OK.
#
# FIXME: we don't want to do this if something else
# other than the stop() action previously called
# service-disable.
ensure_nova_service_enabled
rm $service_disabled_state_file
fi
}
monitor () {
# Here some application-level probe is better than a PID check.
}
But with further thought, I'm not sure that state file brings any
benefit, because it would only optimize away nova-api calls if either
start was called several times consecutively, or stop was, and that
won't happen.
Another problem is that if a stop was rapidly followed by a start
(which we can expect with say, migration-threshold=1), the
asynchronous attempt_nova_service_disable might not complete until
after the synchronous ensure_nova_service_enabled completed, in which
case we'd have a running nova-compute which was accidentally disabled.
Thoughts?
More information about the Users
mailing list