[ClusterLabs] Antw: Re: FR: send failcount to OCF RA start/stop actions

Thu May 12 12:05:16 EDT 2016

On 05/12/2016 06:21 AM, Adam Spiers wrote:
> Hi Ken,
> 
> Firstly thanks a lot not just for working on this, but also for being
> so proactive in discussing the details.  A perfect example of
> OpenStack's "Open Design" philosophy in action :-)
> 
> Ken Gaillot <kgaillot at redhat.com> wrote:
>> On 05/10/2016 02:29 AM, Ulrich Windl wrote:
>>>>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 10.05.2016 um 00:40 in Nachricht
>>> <573111D3.7060102 at redhat.com>:
> 
> [snipped]
> 
>>>> Here is what I'm testing currently:
>>>>
>>>> - When the cluster recovers a resource, the resource agent's stop action
>>>> will get a new variable, OCF_RESKEY_CRM_meta_recovery_left =
>>>> migration-threshold - fail-count on the local node.
>>>
>>> With that mechanism RA testingwill be more complicated as it is
>>> now, and I cannot see the benefit yet.
>>
>> Testing will be more complicated for RAs that choose to behave
>> differently depending on the variable value, but the vast, vast majority
>> won't, so it will have no effect on most users. No pacemaker behavior
>> changes.
>>
>> BTW I should have explicitly mentioned that the variable name is up for
>> discussion; I had a hard time coming up with something meaningful that
>> didn't span an entire line of text.
> 
> I'd prefer plural (OCF_RESKEY_CRM_meta_recoveries_left) but other than
> that I think it's good.  OCF_RESKEY_CRM_meta_retries_left is shorter;
> not sure whether it's marginally worse or better though.

I'm now leaning to restart_remaining (restarts_remaining would be just
as good).

>>>> - The variable is not added for any action other than stop.
>>>>
>>>> - I'm preferring simplicity over flexibility by providing only a single
>>>> variable. The RA theoretically can already get the migration-threshold
>>>> from the CIB and fail-count from attrd -- what we're adding is the
>>>> knowledge that the stop is part of a recovery.
>>>>
>>>> - If the stop is final (the cluster does not plan to start the resource
>>>> anywhere), the variable may be set to 0, or unset. The RA should treat 0
>>>> and unset as equivalent.
>>>>
>>>> - So, the variable will be 1 for the stop before the last time the
>>>> cluster will try to start the resource on the same node, and 0 or unset
>>>> for the last stop on this node before trying to start on another node.
> 
> OK, so the RA code would typically be something like this?
> 
>     if [ ${OCF_RESKEY_CRM_meta_retries_left:-0} = 0 ]; then
>         # This is the final stop, so tell the external service
>         # not to send any more work our way.
>         disable_service
>     fi

I'd use -eq :) but yes

>>> Be aware that the node could be fenced (for reasons ouside of your
>>> RA) even before all these attempts are carried out.
>>
>> Yes, by listing such scenarios and the ones below, I am hoping the
>> potential users of this feature can think through whether it will be
>> sufficient for their use cases.
> 
> That's a good point, but I think it's OK because if the node gets
> fenced, we have one and shortly two different mechanisms for achieving
> the same thing:
> 
>   1. add another custom fencing agent to fencing_topology
>   2. use the new events mechanism
> 
>>>> - The variable will be set only in situations when the cluster will
>>>> consider migration-threshold. This makes sense, but some situations may
>>>> be unintuitive:
>>>>
>>>> -- If a resource is being recovered, but the fail-count is being cleared
>>>> in the same transition, the cluster will ignore migration-threshold (and
>>>> the variable will not be set). The RA might see recovery_left=5, 4, 3,
>>>> then someone clears the fail-count, and it won't see recovery_left even
>>>> though there is a stop and start being attempted.
> 
> Hmm.  So how would the RA distinguish that case from the one where
> the stop is final?

That's the main question in all this. There are quite a few scenarios
where there's no meaningful distinction between 0 and unset. With the
current implementation at least, the ideal approach is for the RA to
treat the last stop before a restart the same as a final stop.

>>>> -- Migration-threshold will be considered (and the variable will be set)
>>>> only if the resource is being recovered due to failure, not if the
>>>> resource is being restarted or moved for some other reason (constraints,
>>>> node standby, etc.).
>>>>
>>>> -- The previous point is true even if the resource is restarting/moving
>>>> because it is part of a group with another member being recovered due to
>>>> failure. Only the failed resource will get the variable. I can see this
>>>> might be problematic for interested RAs, because the resource may be
>>>> restarted several times on the local node then forced away, without the
>>>> variable ever being present -- but the resource will be forced away
>>>> because it is part of a group that is moving, not because it is being
>>>> recovered (its own fail-count stays 0).
> 
> This is a valid concern for the use case in question which I'm quoting
> immediately here for the benefit of those outside the recent
> discussions at the OpenStack summit in Austin:
> 
>>> Can you summarize in one sentence what problem your proposal will solve?
>>
>> While it may be useful to others in the future, the one use case it is
>> intended to address at the moment is:
>>
>> The resource agent for OpenStack compute nodes can disable nova on the
>> local node if the cluster will not try to restart the agent there.
> 
> In this use case, we (SUSE) do indeed place this within a group which
> also includes libvirtd and the neutron openvswitch agent.
> 
> Actually in Austin, Sampath helped me realise that libvirtd should not
> be a strict prerequisite for nova-compute, since nova-compute is
> already able to gracefully handle libvirtd dying and then coming back,
> and in that scenario it is more helpful to keep nova-compute running
> so that nova-server remains appraised of the health of that particular
> compute node.
> 
> Possibly something similar is also true regarding the neutron
> openvswitch agent, but let's assume it's not in case that causes an
> issue here :-
> 
> So IIUC, you are talking about a scenario like this:
> 
> 1. The whole group starts fine.
> 2. Some time later, the neutron openvswitch agent crashes.
> 3. Pacemaker shuts down nova-compute since it depends upon
>    the neutron agent due to being later in the same group.
> 4. Pacemaker repeatedly tries to start the neutron agent,
>    but reaches migration-threshold.
> 
> At this point, nova-compute is permanently down, but its RA never got
> passed OCF_RESKEY_CRM_meta_retries_left with a value of 0 or unset,
> so it never knew to do a nova service-disable.

Basically right, but it would be unset (not empty -- it's never empty).

However, this is a solvable issue. If it's important, I can add the
variable to all siblings of the failed resource if the entire group
would be forced away.

> (BTW, in this scenario, the group is actually cloned, so no migration
> to another compute node happens.)

Clones are the perfect example of the lack of distinction between 0 and
unset. For an anonymous clone running on all nodes, the countdown will
be 3,2,1,unset because the specific clone instance doesn't need to be
started anywhere else (it looks more like a final stop of that
instance). But for unique clones, or anonymous clones where another node
is available to run the instance, it might be 0.

> Did I get that right?  If so, yes it does sound like an issue.  Maybe
> it is possible to avoid this problem by avoiding the use of groups,
> and instead just use interleaved clones with ordering constraints
> between them?

That's not any better, and in fact it would be more difficult to add the
variable to the dependent resource in such a situation, compared to a group.

Generally, only the failed resource will get the variable, not resources
that may be stopped and started because they depend on the failed
resource in some way.

>> More generally, I suppose the point is to better support services that
>> can do a lesser tear-down for a stop-start cycle than a full stop. The
>> distinction between the two cases may not be 100% clear (as with your
>> fencing example), but the idea is that it would be used for
>> optimization, not some required behavior.
> 
> This discussion is prompting me to get this clearer in my head, which
> is good :-)
> 
> I suppose we *could* simply modify the existing NovaCompute OCF RA so
> that every time it executes the 'stop' action, it immediately sends
> the service-disable message to nova-api, and similarly send
> service-enable during the 'start' action.  However this probably has a
> few downsides:
> 
> 1. It could cause rapid flapping of the service state server-side (at
>    least disable followed quickly by enable, or more if it took
>    multiple retries to successfully restart nova-compute), and extra
>    associated noise/load on nova-api and the MQ and DB.
> 2. It would slow down recovery.

If the start can always send service-enable regardless of whether
service-disable was previously sent, without much performance penalty,
then that's a good use case for this. The stop could send
service-disable when the variable is 0 or unset; the gain would be in
not having to send service-disable when the variable is >=1.

> 3. What happens if whatever is causing nova-compute to fail is also
>    causing nova-api to be unreachable from this compute node?

This is not really addressable by the local node. I think in such a
situation, fencing will likely be invoked, and it can be addressed then.

> So as you say, the intended optimization here is to make the
> stop-start cycle faster and more lightweight than the final stop.
> 
>> I am not sure the current implementation described above is sufficient,
>> but it should be a good starting point to work from.
> 
> Hopefully, but you've raised more questions in my head :-)
> 
> For example, I think there are probably other use cases, e.g.
> 
> - Take configurable action after failure to restart libvirtd
>   (one possible action is fencing the node; another is to
>   notify the cloud operator)

Just musing a bit ... on-fail + migration-threshold could have been
designed to be more flexible:

  hard-fail-threshold: When an operation fails this many times, the
cluster will consider the failure to be a "hard" failure. Until this
many failures, the cluster will try to recover the resource on the same
node.

  hard-fail-action: What to do when the operation reaches
hard-fail-threshold ("ban" would work like current "restart" i.e. move
to another node, and ignore/block/stop/standby/fence would work the same
as now)

That would allow fence etc. to be done only after a specified number of
retries. Ah, hindsight ...

But yes, this new variable could be used to take some extra action on
the "final" stop.

> - neutron-l3-agent RA detects that the agent is unhealthy, and iff it
>   fails to restart it, we want to trigger migration of any routers on
>   that l3-agent to a healthy l3-agent.  Currently we wait for the
>   connection between the agent and the neutron server to time out,
>   which is unpleasantly slow.  This case is more of a requirement than
>   an optimization, because we really don't want to migrate routers to
>   another node unless we have to, because a) it takes time, and b) is
>   disruptive enough that we don't want to have to migrate them back
>   soon after if we discover we can successfully recover the unhealthy
>   l3-agent.
> 
> - Remove a failed backend from an haproxy-fronted service if
>   it can't be restarted.
> 
> - Notify any other service (OpenStack or otherwise) where the failing
>   local resource is a backend worker for some central service.  I
>   guess ceilometer, cinder, mistral etc. are all potential
>   examples of this.
> 
> Finally, there's the fundamental question when responsibility of
> monitoring and cleaning up after failures should be handled by
> Pacemaker and OCF RAs, or whether sometimes a central service should
> handle that itself.  For example we could tune the nova / neutron
> agent timeouts to be much more aggressive, and then those servers
> would notice agent failures themselves quick enough that we wouldn't
> have to configure Pacemaker to detect them and then notify the
> servers.
> 
> I'm not sure if there is any good reason why Pacemaker can more
> reliably detect failures than those native keepalive mechanisms.  The
> main difference appears to be that Pacemaker executes monitoring
> directly on the monitored node via lrmd, and then relays the results
> back via corosync, whereas server/agent heartbeating typically relies
> on the state of a simple TCP connection.  In that sense, Pacemaker is
> more flexible in what it can monitor, and the monitoring may also take
> place over different networks depending on the configuration.  And of
> course it can do fencing when this is required.  But in the cases
> where more sophisticated monitoring and fencing are not required,
> I wonder if this is worth the added complexity.  Thoughts?

Pacemaker also adds rich dependencies that can take into account far
more information than the central service will know -- constraints,
utilization attributes, health attributes, rules.

> Thanks!
> Adam
>