[ClusterLabs] Antw: Re: FR: send failcount to OCF RA start/stop actions
Adam Spiers
aspiers at suse.com
Fri May 20 17:40:58 CEST 2016
Ken Gaillot <kgaillot at redhat.com> wrote:
> On 05/12/2016 06:21 AM, Adam Spiers wrote:
> > Ken Gaillot <kgaillot at redhat.com> wrote:
> >> On 05/10/2016 02:29 AM, Ulrich Windl wrote:
> >>>> Here is what I'm testing currently:
> >>>>
> >>>> - When the cluster recovers a resource, the resource agent's stop action
> >>>> will get a new variable, OCF_RESKEY_CRM_meta_recovery_left =
> >>>> migration-threshold - fail-count on the local node.
[snipped]
> > I'd prefer plural (OCF_RESKEY_CRM_meta_recoveries_left) but other than
> > that I think it's good. OCF_RESKEY_CRM_meta_retries_left is shorter;
> > not sure whether it's marginally worse or better though.
>
> I'm now leaning to restart_remaining (restarts_remaining would be just
> as good).
restarts_remaining would be better IMHO, given that it's expected that
often multiple restarts will be remaining.
[snipped]
> > OK, so the RA code would typically be something like this?
> >
> > if [ ${OCF_RESKEY_CRM_meta_retries_left:-0} = 0 ]; then
> > # This is the final stop, so tell the external service
> > # not to send any more work our way.
> > disable_service
> > fi
>
> I'd use -eq :) but yes
Right, -eq is better style for numeric comparison :-)
[snipped]
> >>>> -- If a resource is being recovered, but the fail-count is being cleared
> >>>> in the same transition, the cluster will ignore migration-threshold (and
> >>>> the variable will not be set). The RA might see recovery_left=5, 4, 3,
> >>>> then someone clears the fail-count, and it won't see recovery_left even
> >>>> though there is a stop and start being attempted.
> >
> > Hmm. So how would the RA distinguish that case from the one where
> > the stop is final?
>
> That's the main question in all this. There are quite a few scenarios
> where there's no meaningful distinction between 0 and unset. With the
> current implementation at least, the ideal approach is for the RA to
> treat the last stop before a restart the same as a final stop.
OK ...
[snipped]
> > So IIUC, you are talking about a scenario like this:
> >
> > 1. The whole group starts fine.
> > 2. Some time later, the neutron openvswitch agent crashes.
> > 3. Pacemaker shuts down nova-compute since it depends upon
> > the neutron agent due to being later in the same group.
> > 4. Pacemaker repeatedly tries to start the neutron agent,
> > but reaches migration-threshold.
> >
> > At this point, nova-compute is permanently down, but its RA never got
> > passed OCF_RESKEY_CRM_meta_retries_left with a value of 0 or unset,
> > so it never knew to do a nova service-disable.
>
> Basically right, but it would be unset (not empty -- it's never empty).
>
> However, this is a solvable issue. If it's important, I can add the
> variable to all siblings of the failed resource if the entire group
> would be forced away.
Good to hear.
> > (BTW, in this scenario, the group is actually cloned, so no migration
> > to another compute node happens.)
>
> Clones are the perfect example of the lack of distinction between 0 and
> unset. For an anonymous clone running on all nodes, the countdown will
> be 3,2,1,unset because the specific clone instance doesn't need to be
> started anywhere else (it looks more like a final stop of that
> instance). But for unique clones, or anonymous clones where another node
> is available to run the instance, it might be 0.
I see, thanks.
> > Did I get that right? If so, yes it does sound like an issue. Maybe
> > it is possible to avoid this problem by avoiding the use of groups,
> > and instead just use interleaved clones with ordering constraints
> > between them?
>
> That's not any better, and in fact it would be more difficult to add the
> variable to the dependent resource in such a situation, compared to a group.
>
> Generally, only the failed resource will get the variable, not resources
> that may be stopped and started because they depend on the failed
> resource in some way.
OK. So that might be a problem for you guys than for us, since we use
cloned groups, and you don't:
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/high-availability-for-compute-instances/chapter-1-use-high-availability-to-protect-instances
> >> More generally, I suppose the point is to better support services that
> >> can do a lesser tear-down for a stop-start cycle than a full stop. The
> >> distinction between the two cases may not be 100% clear (as with your
> >> fencing example), but the idea is that it would be used for
> >> optimization, not some required behavior.
> >
> > This discussion is prompting me to get this clearer in my head, which
> > is good :-)
> >
> > I suppose we *could* simply modify the existing NovaCompute OCF RA so
> > that every time it executes the 'stop' action, it immediately sends
> > the service-disable message to nova-api, and similarly send
> > service-enable during the 'start' action. However this probably has a
> > few downsides:
> >
> > 1. It could cause rapid flapping of the service state server-side (at
> > least disable followed quickly by enable, or more if it took
> > multiple retries to successfully restart nova-compute), and extra
> > associated noise/load on nova-api and the MQ and DB.
> > 2. It would slow down recovery.
>
> If the start can always send service-enable regardless of whether
> service-disable was previously sent, without much performance penalty,
> then that's a good use case for this. The stop could send
> service-disable when the variable is 0 or unset; the gain would be in
> not having to send service-disable when the variable is >=1.
Right. I'm not sure I like the idea of always sending service-enable
regardless, even though I was the one to air that possibility. It
would risk overriding a nova service-disable invoked manually by a
cloud operator for other reasons. One way around this might be to
locally cache the expected disable/enable state to a file, and to
only invoke service-enable when service-disable was previously invoked
by the same RA.
> > 3. What happens if whatever is causing nova-compute to fail is also
> > causing nova-api to be unreachable from this compute node?
>
> This is not really addressable by the local node. I think in such a
> situation, fencing will likely be invoked, and it can be addressed then.
Good point.
> > So as you say, the intended optimization here is to make the
> > stop-start cycle faster and more lightweight than the final stop.
> >
> >> I am not sure the current implementation described above is sufficient,
> >> but it should be a good starting point to work from.
> >
> > Hopefully, but you've raised more questions in my head :-)
> >
> > For example, I think there are probably other use cases, e.g.
> >
> > - Take configurable action after failure to restart libvirtd
> > (one possible action is fencing the node; another is to
> > notify the cloud operator)
>
> Just musing a bit ... on-fail + migration-threshold could have been
> designed to be more flexible:
>
> hard-fail-threshold: When an operation fails this many times, the
> cluster will consider the failure to be a "hard" failure. Until this
> many failures, the cluster will try to recover the resource on the same
> node.
How is this different to migration-threshold, other than in name?
> hard-fail-action: What to do when the operation reaches
> hard-fail-threshold ("ban" would work like current "restart" i.e. move
> to another node, and ignore/block/stop/standby/fence would work the same
> as now)
And I'm not sure I understand how this is different to / more flexible
than what we can do with on-fail now?
> That would allow fence etc. to be done only after a specified number of
> retries. Ah, hindsight ...
Isn't that possible now, e.g. with migration-threshold=3 and
on-fail=fence? I feel like I'm missing something.
[snipped]
> > - neutron-l3-agent RA detects that the agent is unhealthy, and iff it
> > fails to restart it, we want to trigger migration of any routers on
> > that l3-agent to a healthy l3-agent. Currently we wait for the
> > connection between the agent and the neutron server to time out,
> > which is unpleasantly slow. This case is more of a requirement than
> > an optimization, because we really don't want to migrate routers to
> > another node unless we have to, because a) it takes time, and b) is
> > disruptive enough that we don't want to have to migrate them back
> > soon after if we discover we can successfully recover the unhealthy
> > l3-agent.
> >
> > - Remove a failed backend from an haproxy-fronted service if
> > it can't be restarted.
> >
> > - Notify any other service (OpenStack or otherwise) where the failing
> > local resource is a backend worker for some central service. I
> > guess ceilometer, cinder, mistral etc. are all potential
> > examples of this.
Any thoughts on the sanity of these?
> > Finally, there's the fundamental question when responsibility of
> > monitoring and cleaning up after failures should be handled by
> > Pacemaker and OCF RAs, or whether sometimes a central service should
> > handle that itself. For example we could tune the nova / neutron
> > agent timeouts to be much more aggressive, and then those servers
> > would notice agent failures themselves quick enough that we wouldn't
> > have to configure Pacemaker to detect them and then notify the
> > servers.
> >
> > I'm not sure if there is any good reason why Pacemaker can more
> > reliably detect failures than those native keepalive mechanisms. The
> > main difference appears to be that Pacemaker executes monitoring
> > directly on the monitored node via lrmd, and then relays the results
> > back via corosync, whereas server/agent heartbeating typically relies
> > on the state of a simple TCP connection. In that sense, Pacemaker is
> > more flexible in what it can monitor, and the monitoring may also take
> > place over different networks depending on the configuration. And of
> > course it can do fencing when this is required. But in the cases
> > where more sophisticated monitoring and fencing are not required,
> > I wonder if this is worth the added complexity. Thoughts?
>
> Pacemaker also adds rich dependencies that can take into account far
> more information than the central service will know -- constraints,
> utilization attributes, health attributes, rules.
True. But this is mainly of benefit when the clean-up involves doing
things to other services, and in cases such as neutron-l3-agent, I
suspect it tends not to.
More information about the Users
mailing list