[ClusterLabs] Antw: Re: FR: send failcount to OCF RA start/stop actions
Adam Spiers
aspiers at suse.com
Mon May 23 21:45:32 UTC 2016
Ken Gaillot <kgaillot at redhat.com> wrote:
> On 05/20/2016 10:40 AM, Adam Spiers wrote:
> > Ken Gaillot <kgaillot at redhat.com> wrote:
> >> Just musing a bit ... on-fail + migration-threshold could have been
> >> designed to be more flexible:
> >>
> >> hard-fail-threshold: When an operation fails this many times, the
> >> cluster will consider the failure to be a "hard" failure. Until this
> >> many failures, the cluster will try to recover the resource on the same
> >> node.
> >
> > How is this different to migration-threshold, other than in name?
> >
> >> hard-fail-action: What to do when the operation reaches
> >> hard-fail-threshold ("ban" would work like current "restart" i.e. move
> >> to another node, and ignore/block/stop/standby/fence would work the same
> >> as now)
> >
> > And I'm not sure I understand how this is different to / more flexible
> > than what we can do with on-fail now?
> >
> >> That would allow fence etc. to be done only after a specified number of
> >> retries. Ah, hindsight ...
> >
> > Isn't that possible now, e.g. with migration-threshold=3 and
> > on-fail=fence? I feel like I'm missing something.
>
> migration-threshold only applies when on-fail=restart. If on-fail=fence
> or something else, that action always applies after the first failure.
*sound of penny dropping*
Ahah! Thanks, yes that's what I was missing :-)
> So hard-fail-threshold would indeed be the same as migration-threshold,
> but applied to all actions (and would be renamed, since the resource
> won't migrate in the other cases).
Gotcha.
> >>> - neutron-l3-agent RA detects that the agent is unhealthy, and iff it
> >>> fails to restart it, we want to trigger migration of any routers on
> >>> that l3-agent to a healthy l3-agent. Currently we wait for the
> >>> connection between the agent and the neutron server to time out,
> >>> which is unpleasantly slow. This case is more of a requirement than
> >>> an optimization, because we really don't want to migrate routers to
> >>> another node unless we have to, because a) it takes time, and b) is
> >>> disruptive enough that we don't want to have to migrate them back
> >>> soon after if we discover we can successfully recover the unhealthy
> >>> l3-agent.
> >>>
> >>> - Remove a failed backend from an haproxy-fronted service if
> >>> it can't be restarted.
> >>>
> >>> - Notify any other service (OpenStack or otherwise) where the failing
> >>> local resource is a backend worker for some central service. I
> >>> guess ceilometer, cinder, mistral etc. are all potential
> >>> examples of this.
> >
> > Any thoughts on the sanity of these?
>
> Beyond my expertise. But sounds reasonable.
We should probably migrate this part of the discussion to
openstack-dev ...
More information about the Users
mailing list