[ClusterLabs] Antw: Re: FR: send failcount to OCF RA start/stop actions

Mon May 23 17:04:51 UTC 2016

On 05/20/2016 10:40 AM, Adam Spiers wrote:
> Ken Gaillot <kgaillot at redhat.com> wrote:
>> Just musing a bit ... on-fail + migration-threshold could have been
>> designed to be more flexible:
>>
>>   hard-fail-threshold: When an operation fails this many times, the
>> cluster will consider the failure to be a "hard" failure. Until this
>> many failures, the cluster will try to recover the resource on the same
>> node.
> 
> How is this different to migration-threshold, other than in name?
> 
>>   hard-fail-action: What to do when the operation reaches
>> hard-fail-threshold ("ban" would work like current "restart" i.e. move
>> to another node, and ignore/block/stop/standby/fence would work the same
>> as now)
> 
> And I'm not sure I understand how this is different to / more flexible
> than what we can do with on-fail now?
> 
>> That would allow fence etc. to be done only after a specified number of
>> retries. Ah, hindsight ...
> 
> Isn't that possible now, e.g. with migration-threshold=3 and
> on-fail=fence?  I feel like I'm missing something.

migration-threshold only applies when on-fail=restart. If on-fail=fence
or something else, that action always applies after the first failure.

So hard-fail-threshold would indeed be the same as migration-threshold,
but applied to all actions (and would be renamed, since the resource
won't migrate in the other cases).

>>> - neutron-l3-agent RA detects that the agent is unhealthy, and iff it
>>>   fails to restart it, we want to trigger migration of any routers on
>>>   that l3-agent to a healthy l3-agent.  Currently we wait for the
>>>   connection between the agent and the neutron server to time out,
>>>   which is unpleasantly slow.  This case is more of a requirement than
>>>   an optimization, because we really don't want to migrate routers to
>>>   another node unless we have to, because a) it takes time, and b) is
>>>   disruptive enough that we don't want to have to migrate them back
>>>   soon after if we discover we can successfully recover the unhealthy
>>>   l3-agent.
>>>
>>> - Remove a failed backend from an haproxy-fronted service if
>>>   it can't be restarted.
>>>
>>> - Notify any other service (OpenStack or otherwise) where the failing
>>>   local resource is a backend worker for some central service.  I
>>>   guess ceilometer, cinder, mistral etc. are all potential
>>>   examples of this.
> 
> Any thoughts on the sanity of these?

Beyond my expertise. But sounds reasonable.