[ClusterLabs] Antw: Re: FR: send failcount to OCF RA start/stop actions

Fri May 20 15:40:58 UTC 2016

Ken Gaillot <kgaillot at redhat.com> wrote:
> On 05/12/2016 06:21 AM, Adam Spiers wrote:
> > Ken Gaillot <kgaillot at redhat.com> wrote:
> >> On 05/10/2016 02:29 AM, Ulrich Windl wrote:
> >>>> Here is what I'm testing currently:
> >>>>
> >>>> - When the cluster recovers a resource, the resource agent's stop action
> >>>> will get a new variable, OCF_RESKEY_CRM_meta_recovery_left =
> >>>> migration-threshold - fail-count on the local node.

[snipped]

> > I'd prefer plural (OCF_RESKEY_CRM_meta_recoveries_left) but other than
> > that I think it's good.  OCF_RESKEY_CRM_meta_retries_left is shorter;
> > not sure whether it's marginally worse or better though.
> 
> I'm now leaning to restart_remaining (restarts_remaining would be just
> as good).

restarts_remaining would be better IMHO, given that it's expected that
often multiple restarts will be remaining.

[snipped]

> > OK, so the RA code would typically be something like this?
> > 
> >     if [ ${OCF_RESKEY_CRM_meta_retries_left:-0} = 0 ]; then
> >         # This is the final stop, so tell the external service
> >         # not to send any more work our way.
> >         disable_service
> >     fi
> 
> I'd use -eq :) but yes

Right, -eq is better style for numeric comparison :-)

[snipped]

> >>>> -- If a resource is being recovered, but the fail-count is being cleared
> >>>> in the same transition, the cluster will ignore migration-threshold (and
> >>>> the variable will not be set). The RA might see recovery_left=5, 4, 3,
> >>>> then someone clears the fail-count, and it won't see recovery_left even
> >>>> though there is a stop and start being attempted.
> > 
> > Hmm.  So how would the RA distinguish that case from the one where
> > the stop is final?
> 
> That's the main question in all this. There are quite a few scenarios
> where there's no meaningful distinction between 0 and unset. With the
> current implementation at least, the ideal approach is for the RA to
> treat the last stop before a restart the same as a final stop.

OK ...

[snipped]

> > So IIUC, you are talking about a scenario like this:
> > 
> > 1. The whole group starts fine.
> > 2. Some time later, the neutron openvswitch agent crashes.
> > 3. Pacemaker shuts down nova-compute since it depends upon
> >    the neutron agent due to being later in the same group.
> > 4. Pacemaker repeatedly tries to start the neutron agent,
> >    but reaches migration-threshold.
> > 
> > At this point, nova-compute is permanently down, but its RA never got
> > passed OCF_RESKEY_CRM_meta_retries_left with a value of 0 or unset,
> > so it never knew to do a nova service-disable.
> 
> Basically right, but it would be unset (not empty -- it's never empty).
> 
> However, this is a solvable issue. If it's important, I can add the
> variable to all siblings of the failed resource if the entire group
> would be forced away.

Good to hear.

> > (BTW, in this scenario, the group is actually cloned, so no migration
> > to another compute node happens.)
> 
> Clones are the perfect example of the lack of distinction between 0 and
> unset. For an anonymous clone running on all nodes, the countdown will
> be 3,2,1,unset because the specific clone instance doesn't need to be
> started anywhere else (it looks more like a final stop of that
> instance). But for unique clones, or anonymous clones where another node
> is available to run the instance, it might be 0.

I see, thanks.

> > Did I get that right?  If so, yes it does sound like an issue.  Maybe
> > it is possible to avoid this problem by avoiding the use of groups,
> > and instead just use interleaved clones with ordering constraints
> > between them?
> 
> That's not any better, and in fact it would be more difficult to add the
> variable to the dependent resource in such a situation, compared to a group.
> 
> Generally, only the failed resource will get the variable, not resources
> that may be stopped and started because they depend on the failed
> resource in some way.

OK.  So that might be a problem for you guys than for us, since we use
cloned groups, and you don't:

https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/high-availability-for-compute-instances/chapter-1-use-high-availability-to-protect-instances

> >> More generally, I suppose the point is to better support services that
> >> can do a lesser tear-down for a stop-start cycle than a full stop. The
> >> distinction between the two cases may not be 100% clear (as with your
> >> fencing example), but the idea is that it would be used for
> >> optimization, not some required behavior.
> > 
> > This discussion is prompting me to get this clearer in my head, which
> > is good :-)
> > 
> > I suppose we *could* simply modify the existing NovaCompute OCF RA so
> > that every time it executes the 'stop' action, it immediately sends
> > the service-disable message to nova-api, and similarly send
> > service-enable during the 'start' action.  However this probably has a
> > few downsides:
> > 
> > 1. It could cause rapid flapping of the service state server-side (at
> >    least disable followed quickly by enable, or more if it took
> >    multiple retries to successfully restart nova-compute), and extra
> >    associated noise/load on nova-api and the MQ and DB.
> > 2. It would slow down recovery.
> 
> If the start can always send service-enable regardless of whether
> service-disable was previously sent, without much performance penalty,
> then that's a good use case for this. The stop could send
> service-disable when the variable is 0 or unset; the gain would be in
> not having to send service-disable when the variable is >=1.

Right.  I'm not sure I like the idea of always sending service-enable
regardless, even though I was the one to air that possibility.  It
would risk overriding a nova service-disable invoked manually by a
cloud operator for other reasons.  One way around this might be to
locally cache the expected disable/enable state to a file, and to
only invoke service-enable when service-disable was previously invoked
by the same RA.

> > 3. What happens if whatever is causing nova-compute to fail is also
> >    causing nova-api to be unreachable from this compute node?
> 
> This is not really addressable by the local node. I think in such a
> situation, fencing will likely be invoked, and it can be addressed then.

Good point.

> > So as you say, the intended optimization here is to make the
> > stop-start cycle faster and more lightweight than the final stop.
> > 
> >> I am not sure the current implementation described above is sufficient,
> >> but it should be a good starting point to work from.
> > 
> > Hopefully, but you've raised more questions in my head :-)
> > 
> > For example, I think there are probably other use cases, e.g.
> > 
> > - Take configurable action after failure to restart libvirtd
> >   (one possible action is fencing the node; another is to
> >   notify the cloud operator)
> 
> Just musing a bit ... on-fail + migration-threshold could have been
> designed to be more flexible:
> 
>   hard-fail-threshold: When an operation fails this many times, the
> cluster will consider the failure to be a "hard" failure. Until this
> many failures, the cluster will try to recover the resource on the same
> node.

How is this different to migration-threshold, other than in name?

>   hard-fail-action: What to do when the operation reaches
> hard-fail-threshold ("ban" would work like current "restart" i.e. move
> to another node, and ignore/block/stop/standby/fence would work the same
> as now)

And I'm not sure I understand how this is different to / more flexible
than what we can do with on-fail now?

> That would allow fence etc. to be done only after a specified number of
> retries. Ah, hindsight ...

Isn't that possible now, e.g. with migration-threshold=3 and
on-fail=fence?  I feel like I'm missing something.

[snipped]

> > - neutron-l3-agent RA detects that the agent is unhealthy, and iff it
> >   fails to restart it, we want to trigger migration of any routers on
> >   that l3-agent to a healthy l3-agent.  Currently we wait for the
> >   connection between the agent and the neutron server to time out,
> >   which is unpleasantly slow.  This case is more of a requirement than
> >   an optimization, because we really don't want to migrate routers to
> >   another node unless we have to, because a) it takes time, and b) is
> >   disruptive enough that we don't want to have to migrate them back
> >   soon after if we discover we can successfully recover the unhealthy
> >   l3-agent.
> > 
> > - Remove a failed backend from an haproxy-fronted service if
> >   it can't be restarted.
> > 
> > - Notify any other service (OpenStack or otherwise) where the failing
> >   local resource is a backend worker for some central service.  I
> >   guess ceilometer, cinder, mistral etc. are all potential
> >   examples of this.

Any thoughts on the sanity of these?

> > Finally, there's the fundamental question when responsibility of
> > monitoring and cleaning up after failures should be handled by
> > Pacemaker and OCF RAs, or whether sometimes a central service should
> > handle that itself.  For example we could tune the nova / neutron
> > agent timeouts to be much more aggressive, and then those servers
> > would notice agent failures themselves quick enough that we wouldn't
> > have to configure Pacemaker to detect them and then notify the
> > servers.
> > 
> > I'm not sure if there is any good reason why Pacemaker can more
> > reliably detect failures than those native keepalive mechanisms.  The
> > main difference appears to be that Pacemaker executes monitoring
> > directly on the monitored node via lrmd, and then relays the results
> > back via corosync, whereas server/agent heartbeating typically relies
> > on the state of a simple TCP connection.  In that sense, Pacemaker is
> > more flexible in what it can monitor, and the monitoring may also take
> > place over different networks depending on the configuration.  And of
> > course it can do fencing when this is required.  But in the cases
> > where more sophisticated monitoring and fencing are not required,
> > I wonder if this is worth the added complexity.  Thoughts?
> 
> Pacemaker also adds rich dependencies that can take into account far
> more information than the central service will know -- constraints,
> utilization attributes, health attributes, rules.

True.  But this is mainly of benefit when the clean-up involves doing
things to other services, and in cases such as neutron-l3-agent, I
suspect it tends not to.