[ClusterLabs] Antw: Re: FR: send failcount to OCF RA start/stop actions

Mon May 23 21:45:32 UTC 2016

Ken Gaillot <kgaillot at redhat.com> wrote:
> On 05/20/2016 10:40 AM, Adam Spiers wrote:
> > Ken Gaillot <kgaillot at redhat.com> wrote:
> >> Just musing a bit ... on-fail + migration-threshold could have been
> >> designed to be more flexible:
> >>
> >>   hard-fail-threshold: When an operation fails this many times, the
> >> cluster will consider the failure to be a "hard" failure. Until this
> >> many failures, the cluster will try to recover the resource on the same
> >> node.
> > 
> > How is this different to migration-threshold, other than in name?
> > 
> >>   hard-fail-action: What to do when the operation reaches
> >> hard-fail-threshold ("ban" would work like current "restart" i.e. move
> >> to another node, and ignore/block/stop/standby/fence would work the same
> >> as now)
> > 
> > And I'm not sure I understand how this is different to / more flexible
> > than what we can do with on-fail now?
> > 
> >> That would allow fence etc. to be done only after a specified number of
> >> retries. Ah, hindsight ...
> > 
> > Isn't that possible now, e.g. with migration-threshold=3 and
> > on-fail=fence?  I feel like I'm missing something.
> 
> migration-threshold only applies when on-fail=restart. If on-fail=fence
> or something else, that action always applies after the first failure.

*sound of penny dropping*

Ahah!  Thanks, yes that's what I was missing :-)

> So hard-fail-threshold would indeed be the same as migration-threshold,
> but applied to all actions (and would be renamed, since the resource
> won't migrate in the other cases).

Gotcha.

> >>> - neutron-l3-agent RA detects that the agent is unhealthy, and iff it
> >>>   fails to restart it, we want to trigger migration of any routers on
> >>>   that l3-agent to a healthy l3-agent.  Currently we wait for the
> >>>   connection between the agent and the neutron server to time out,
> >>>   which is unpleasantly slow.  This case is more of a requirement than
> >>>   an optimization, because we really don't want to migrate routers to
> >>>   another node unless we have to, because a) it takes time, and b) is
> >>>   disruptive enough that we don't want to have to migrate them back
> >>>   soon after if we discover we can successfully recover the unhealthy
> >>>   l3-agent.
> >>>
> >>> - Remove a failed backend from an haproxy-fronted service if
> >>>   it can't be restarted.
> >>>
> >>> - Notify any other service (OpenStack or otherwise) where the failing
> >>>   local resource is a backend worker for some central service.  I
> >>>   guess ceilometer, cinder, mistral etc. are all potential
> >>>   examples of this.
> > 
> > Any thoughts on the sanity of these?
> 
> Beyond my expertise. But sounds reasonable.

We should probably migrate this part of the discussion to
openstack-dev ...