[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?
aspiers at suse.com
Fri Jun 24 12:41:09 CEST 2016
Andrew Beekhof <abeekhof at redhat.com> wrote:
> On Fri, Jun 24, 2016 at 1:01 AM, Adam Spiers <aspiers at suse.com> wrote:
> > Andrew Beekhof <abeekhof at redhat.com> wrote:
> >> > Well, if you're OK with bending the rules like this then that's good
> >> > enough for me to say we should at least try it :)
> >> I still say you shouldn't only do it on error.
> > When else should it be done?
> I was thinking whenever a stop() happens.
OK, seems we are agreed finally :)
> > IIUC, disabling/enabling the service is independent of the up/down
> > state which nova tracks automatically, and which based on slightly
> > more than a skim of the code, is dependent on the state of the RPC
> > layer.
> >> > But how would you avoid repeated consecutive invocations of "nova
> >> > service-disable" when the monitor action fails, and ditto for "nova
> >> > service-enable" when it succeeds?
> >> I don't think you can. Not ideal but I'd not have thought a deal breaker.
> > Sounds like a massive deal-breaker to me! With op monitor
> > interval="10s" and 100 compute nodes, that would mean 10 pointless
> > calls to nova-api every second. Am I missing something?
> I was thinking you would only call it for the "I detected a failure
> case" and service-enable would still be on start().
> So the number of pointless calls per second would be capped at one
> tenth of the number of failed compute nodes.
> One would hope that all of them weren't dead.
Oh OK - yeah that wouldn't be nearly as bad.
> > Also I don't see any benefit to moving the API calls from start/stop
> > actions to the monitor action. If there's a failure, Pacemaker will
> > invoke the stop action, so we can do service-disable there.
> I agree. Doing it unconditionally at stop() is my preferred option, I
> was only trying to provide a path that might be close to the behaviour
> you were looking for.
> > If the
> > start action is invoked and we successfully initiate startup of
> > nova-compute, the RA can undo any service-disable it previously did
> > (although it should not reverse a service-disable done elsewhere,
> > e.g. manually by the cloud operator).
Trying to adjust to this new sensation of agreement ;-)
> >> > Earlier in this thread I proposed
> >> > the idea of a tiny temporary file in /run which tracks the last known
> >> > state and optimizes away the consecutive invocations, but IIRC you
> >> > were against that.
> >> I'm generally not a fan, but sometimes state files are a necessity.
> >> Just make sure you think through what a missing file might mean.
> > Sure. A missing file would mean the RA's never called service-disable
> > before,
> And that is why I generally don't like state files.
> The default location for state files doesn't persist across reboots.
> t1. stop (ie. disable)
> t2. reboot
> t3. start with no state file
> t4. WHY WONT NOVA USE THE NEW COMPUTE NODE STUPID CLUSTERS
Well then we simply put the state file somewhere which does persist
> > which means that it shouldn't call service-enable on startup.
> >> Unless.... use the state file to store the date at which the last
> >> start operation occurred?
> >> If we're calling stop() and data - start_date > threshold, then, if
> >> you must, be optimistic, skip service-disable and assume we'll get
> >> started again soon.
> >> Otherwise if we're calling stop() and data - start_date <= threshold,
> >> always call service-disable because we're in a restart loop which is
> >> not worth optimising for.
> >> ( And always call service-enable at start() )
> >> No Pacemaker feature or Beekhof approval required :-)
> > Hmm ... it's possible I just don't understand this proposal fully,
> > but it sounds a bit woolly to me, e.g. how would you decide a suitable
> > threshold?
> roll a dice?
> > I think I preferred your other suggestion of just skipping the
> > optimization, i.e. calling service-disable on the first stop, and
> > service-enable on (almost) every start.
> good :)
> And the use of force-down from your subsequent email sounds excellent
OK great! We finally got there :-) Now I guess I just have to write
the spec and the actual code ;-)
More information about the Users