[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Thu Jun 23 19:35:43 EDT 2016

On Fri, Jun 24, 2016 at 1:01 AM, Adam Spiers <aspiers at suse.com> wrote:
> Andrew Beekhof <abeekhof at redhat.com> wrote:

>> > Well, if you're OK with bending the rules like this then that's good
>> > enough for me to say we should at least try it :)
>>
>> I still say you shouldn't only do it on error.
>
> When else should it be done?

I was thinking whenever a stop() happens.

> IIUC, disabling/enabling the service is independent of the up/down
> state which nova tracks automatically, and which based on slightly
> more than a skim of the code, is dependent on the state of the RPC
> layer.
>
>> > But how would you avoid repeated consecutive invocations of "nova
>> > service-disable" when the monitor action fails, and ditto for "nova
>> > service-enable" when it succeeds?
>>
>> I don't think you can. Not ideal but I'd not have thought a deal breaker.
>
> Sounds like a massive deal-breaker to me!  With op monitor
> interval="10s" and 100 compute nodes, that would mean 10 pointless
> calls to nova-api every second.  Am I missing something?

I was thinking you would only call it for the "I detected a failure
case" and service-enable would still be on start().
So the number of pointless calls per second would be capped at one
tenth of the number of failed compute nodes.

One would hope that all of them weren't dead.

>
> Also I don't see any benefit to moving the API calls from start/stop
> actions to the monitor action.  If there's a failure, Pacemaker will
> invoke the stop action, so we can do service-disable there.

I agree. Doing it unconditionally at stop() is my preferred option, I
was only trying to provide a path that might be close to the behaviour
you were looking for.

> If the
> start action is invoked and we successfully initiate startup of
> nova-compute, the RA can undo any service-disable it previously did
> (although it should not reverse a service-disable done elsewhere,
> e.g. manually by the cloud operator).

Agree

>
>> > Earlier in this thread I proposed
>> > the idea of a tiny temporary file in /run which tracks the last known
>> > state and optimizes away the consecutive invocations, but IIRC you
>> > were against that.
>>
>> I'm generally not a fan, but sometimes state files are a necessity.
>> Just make sure you think through what a missing file might mean.
>
> Sure.  A missing file would mean the RA's never called service-disable
> before,

And that is why I generally don't like state files.
The default location for state files doesn't persist across reboots.

t1. stop (ie. disable)
t2. reboot
t3. start with no state file
t4. WHY WONT NOVA USE THE NEW COMPUTE NODE STUPID CLUSTERS

> which means that it shouldn't call service-enable on startup.
>
>> Unless.... use the state file to store the date at which the last
>> start operation occurred?
>>
>> If we're calling stop() and data - start_date > threshold, then, if
>> you must, be optimistic, skip service-disable and assume we'll get
>> started again soon.
>>
>> Otherwise if we're calling stop() and data - start_date <= threshold,
>> always call service-disable because we're in a restart loop which is
>> not worth optimising for.
>>
>> ( And always call service-enable at start() )
>>
>> No Pacemaker feature or Beekhof approval required :-)
>
> Hmm ...  it's possible I just don't understand this proposal fully,
> but it sounds a bit woolly to me, e.g. how would you decide a suitable
> threshold?

roll a dice?

> I think I preferred your other suggestion of just skipping the
> optimization, i.e. calling service-disable on the first stop, and
> service-enable on (almost) every start.

good :)

And the use of force-down from your subsequent email sounds excellent