[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Fri Jun 24 12:41:09 CEST 2016

Andrew Beekhof <abeekhof at redhat.com> wrote:
> On Fri, Jun 24, 2016 at 1:01 AM, Adam Spiers <aspiers at suse.com> wrote:
> > Andrew Beekhof <abeekhof at redhat.com> wrote:
> 
> >> > Well, if you're OK with bending the rules like this then that's good
> >> > enough for me to say we should at least try it :)
> >>
> >> I still say you shouldn't only do it on error.
> >
> > When else should it be done?
> 
> I was thinking whenever a stop() happens.

OK, seems we are agreed finally :)

> > IIUC, disabling/enabling the service is independent of the up/down
> > state which nova tracks automatically, and which based on slightly
> > more than a skim of the code, is dependent on the state of the RPC
> > layer.
> >
> >> > But how would you avoid repeated consecutive invocations of "nova
> >> > service-disable" when the monitor action fails, and ditto for "nova
> >> > service-enable" when it succeeds?
> >>
> >> I don't think you can. Not ideal but I'd not have thought a deal breaker.
> >
> > Sounds like a massive deal-breaker to me!  With op monitor
> > interval="10s" and 100 compute nodes, that would mean 10 pointless
> > calls to nova-api every second.  Am I missing something?
> 
> I was thinking you would only call it for the "I detected a failure
> case" and service-enable would still be on start().
> So the number of pointless calls per second would be capped at one
> tenth of the number of failed compute nodes.
> 
> One would hope that all of them weren't dead.

Oh OK - yeah that wouldn't be nearly as bad.

> > Also I don't see any benefit to moving the API calls from start/stop
> > actions to the monitor action.  If there's a failure, Pacemaker will
> > invoke the stop action, so we can do service-disable there.
> 
> I agree. Doing it unconditionally at stop() is my preferred option, I
> was only trying to provide a path that might be close to the behaviour
> you were looking for.
> 
> > If the
> > start action is invoked and we successfully initiate startup of
> > nova-compute, the RA can undo any service-disable it previously did
> > (although it should not reverse a service-disable done elsewhere,
> > e.g. manually by the cloud operator).
> 
> Agree

Trying to adjust to this new sensation of agreement ;-)

> >> > Earlier in this thread I proposed
> >> > the idea of a tiny temporary file in /run which tracks the last known
> >> > state and optimizes away the consecutive invocations, but IIRC you
> >> > were against that.
> >>
> >> I'm generally not a fan, but sometimes state files are a necessity.
> >> Just make sure you think through what a missing file might mean.
> >
> > Sure.  A missing file would mean the RA's never called service-disable
> > before,
> 
> And that is why I generally don't like state files.
> The default location for state files doesn't persist across reboots.
> 
> t1. stop (ie. disable)
> t2. reboot
> t3. start with no state file
> t4. WHY WONT NOVA USE THE NEW COMPUTE NODE STUPID CLUSTERS

Well then we simply put the state file somewhere which does persist
across reboots.

> > which means that it shouldn't call service-enable on startup.
> >
> >> Unless.... use the state file to store the date at which the last
> >> start operation occurred?
> >>
> >> If we're calling stop() and data - start_date > threshold, then, if
> >> you must, be optimistic, skip service-disable and assume we'll get
> >> started again soon.
> >>
> >> Otherwise if we're calling stop() and data - start_date <= threshold,
> >> always call service-disable because we're in a restart loop which is
> >> not worth optimising for.
> >>
> >> ( And always call service-enable at start() )
> >>
> >> No Pacemaker feature or Beekhof approval required :-)
> >
> > Hmm ...  it's possible I just don't understand this proposal fully,
> > but it sounds a bit woolly to me, e.g. how would you decide a suitable
> > threshold?
> 
> roll a dice?
> 
> > I think I preferred your other suggestion of just skipping the
> > optimization, i.e. calling service-disable on the first stop, and
> > service-enable on (almost) every start.
> 
> good :)
> 
> And the use of force-down from your subsequent email sounds excellent

OK great!  We finally got there :-)  Now I guess I just have to write
the spec and the actual code ;-)