[ClusterLabs] Antw: Re: FR: send failcount to OCF RA start/stop actions
Ken Gaillot
kgaillot at redhat.com
Mon May 23 17:04:51 UTC 2016
On 05/20/2016 10:40 AM, Adam Spiers wrote:
> Ken Gaillot <kgaillot at redhat.com> wrote:
>> Just musing a bit ... on-fail + migration-threshold could have been
>> designed to be more flexible:
>>
>> hard-fail-threshold: When an operation fails this many times, the
>> cluster will consider the failure to be a "hard" failure. Until this
>> many failures, the cluster will try to recover the resource on the same
>> node.
>
> How is this different to migration-threshold, other than in name?
>
>> hard-fail-action: What to do when the operation reaches
>> hard-fail-threshold ("ban" would work like current "restart" i.e. move
>> to another node, and ignore/block/stop/standby/fence would work the same
>> as now)
>
> And I'm not sure I understand how this is different to / more flexible
> than what we can do with on-fail now?
>
>> That would allow fence etc. to be done only after a specified number of
>> retries. Ah, hindsight ...
>
> Isn't that possible now, e.g. with migration-threshold=3 and
> on-fail=fence? I feel like I'm missing something.
migration-threshold only applies when on-fail=restart. If on-fail=fence
or something else, that action always applies after the first failure.
So hard-fail-threshold would indeed be the same as migration-threshold,
but applied to all actions (and would be renamed, since the resource
won't migrate in the other cases).
>>> - neutron-l3-agent RA detects that the agent is unhealthy, and iff it
>>> fails to restart it, we want to trigger migration of any routers on
>>> that l3-agent to a healthy l3-agent. Currently we wait for the
>>> connection between the agent and the neutron server to time out,
>>> which is unpleasantly slow. This case is more of a requirement than
>>> an optimization, because we really don't want to migrate routers to
>>> another node unless we have to, because a) it takes time, and b) is
>>> disruptive enough that we don't want to have to migrate them back
>>> soon after if we discover we can successfully recover the unhealthy
>>> l3-agent.
>>>
>>> - Remove a failed backend from an haproxy-fronted service if
>>> it can't be restarted.
>>>
>>> - Notify any other service (OpenStack or otherwise) where the failing
>>> local resource is a backend worker for some central service. I
>>> guess ceilometer, cinder, mistral etc. are all potential
>>> examples of this.
>
> Any thoughts on the sanity of these?
Beyond my expertise. But sounds reasonable.
More information about the Users
mailing list