[ClusterLabs] Antw: Re: FR: send failcount to OCF RA start/stop actions

Thu May 12 07:21:35 EDT 2016

Hi Ken,

Firstly thanks a lot not just for working on this, but also for being
so proactive in discussing the details.  A perfect example of
OpenStack's "Open Design" philosophy in action :-)

Ken Gaillot <kgaillot at redhat.com> wrote:
> On 05/10/2016 02:29 AM, Ulrich Windl wrote:
> >>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 10.05.2016 um 00:40 in Nachricht
> > <573111D3.7060102 at redhat.com>:

[snipped]

> >> Here is what I'm testing currently:
> >>
> >> - When the cluster recovers a resource, the resource agent's stop action
> >> will get a new variable, OCF_RESKEY_CRM_meta_recovery_left =
> >> migration-threshold - fail-count on the local node.
> > 
> > With that mechanism RA testingwill be more complicated as it is
> > now, and I cannot see the benefit yet.
> 
> Testing will be more complicated for RAs that choose to behave
> differently depending on the variable value, but the vast, vast majority
> won't, so it will have no effect on most users. No pacemaker behavior
> changes.
> 
> BTW I should have explicitly mentioned that the variable name is up for
> discussion; I had a hard time coming up with something meaningful that
> didn't span an entire line of text.

I'd prefer plural (OCF_RESKEY_CRM_meta_recoveries_left) but other than
that I think it's good.  OCF_RESKEY_CRM_meta_retries_left is shorter;
not sure whether it's marginally worse or better though.

> >> - The variable is not added for any action other than stop.
> >>
> >> - I'm preferring simplicity over flexibility by providing only a single
> >> variable. The RA theoretically can already get the migration-threshold
> >> from the CIB and fail-count from attrd -- what we're adding is the
> >> knowledge that the stop is part of a recovery.
> >>
> >> - If the stop is final (the cluster does not plan to start the resource
> >> anywhere), the variable may be set to 0, or unset. The RA should treat 0
> >> and unset as equivalent.
> >>
> >> - So, the variable will be 1 for the stop before the last time the
> >> cluster will try to start the resource on the same node, and 0 or unset
> >> for the last stop on this node before trying to start on another node.

OK, so the RA code would typically be something like this?

    if [ ${OCF_RESKEY_CRM_meta_retries_left:-0} = 0 ]; then
        # This is the final stop, so tell the external service
        # not to send any more work our way.
        disable_service
    fi

> > Be aware that the node could be fenced (for reasons ouside of your
> > RA) even before all these attempts are carried out.
> 
> Yes, by listing such scenarios and the ones below, I am hoping the
> potential users of this feature can think through whether it will be
> sufficient for their use cases.

That's a good point, but I think it's OK because if the node gets
fenced, we have one and shortly two different mechanisms for achieving
the same thing:

  1. add another custom fencing agent to fencing_topology
  2. use the new events mechanism

> >> - The variable will be set only in situations when the cluster will
> >> consider migration-threshold. This makes sense, but some situations may
> >> be unintuitive:
> >>
> >> -- If a resource is being recovered, but the fail-count is being cleared
> >> in the same transition, the cluster will ignore migration-threshold (and
> >> the variable will not be set). The RA might see recovery_left=5, 4, 3,
> >> then someone clears the fail-count, and it won't see recovery_left even
> >> though there is a stop and start being attempted.

Hmm.  So how would the RA distinguish that case from the one where
the stop is final?

> >> -- Migration-threshold will be considered (and the variable will be set)
> >> only if the resource is being recovered due to failure, not if the
> >> resource is being restarted or moved for some other reason (constraints,
> >> node standby, etc.).
> >>
> >> -- The previous point is true even if the resource is restarting/moving
> >> because it is part of a group with another member being recovered due to
> >> failure. Only the failed resource will get the variable. I can see this
> >> might be problematic for interested RAs, because the resource may be
> >> restarted several times on the local node then forced away, without the
> >> variable ever being present -- but the resource will be forced away
> >> because it is part of a group that is moving, not because it is being
> >> recovered (its own fail-count stays 0).

This is a valid concern for the use case in question which I'm quoting
immediately here for the benefit of those outside the recent
discussions at the OpenStack summit in Austin:

> > Can you summarize in one sentence what problem your proposal will solve?
> 
> While it may be useful to others in the future, the one use case it is
> intended to address at the moment is:
> 
> The resource agent for OpenStack compute nodes can disable nova on the
> local node if the cluster will not try to restart the agent there.

In this use case, we (SUSE) do indeed place this within a group which
also includes libvirtd and the neutron openvswitch agent.

Actually in Austin, Sampath helped me realise that libvirtd should not
be a strict prerequisite for nova-compute, since nova-compute is
already able to gracefully handle libvirtd dying and then coming back,
and in that scenario it is more helpful to keep nova-compute running
so that nova-server remains appraised of the health of that particular
compute node.

Possibly something similar is also true regarding the neutron
openvswitch agent, but let's assume it's not in case that causes an
issue here :-

So IIUC, you are talking about a scenario like this:

1. The whole group starts fine.
2. Some time later, the neutron openvswitch agent crashes.
3. Pacemaker shuts down nova-compute since it depends upon
   the neutron agent due to being later in the same group.
4. Pacemaker repeatedly tries to start the neutron agent,
   but reaches migration-threshold.

At this point, nova-compute is permanently down, but its RA never got
passed OCF_RESKEY_CRM_meta_retries_left with a value of 0 or unset,
so it never knew to do a nova service-disable.

(BTW, in this scenario, the group is actually cloned, so no migration
to another compute node happens.)

Did I get that right?  If so, yes it does sound like an issue.  Maybe
it is possible to avoid this problem by avoiding the use of groups,
and instead just use interleaved clones with ordering constraints
between them?

> More generally, I suppose the point is to better support services that
> can do a lesser tear-down for a stop-start cycle than a full stop. The
> distinction between the two cases may not be 100% clear (as with your
> fencing example), but the idea is that it would be used for
> optimization, not some required behavior.

This discussion is prompting me to get this clearer in my head, which
is good :-)

I suppose we *could* simply modify the existing NovaCompute OCF RA so
that every time it executes the 'stop' action, it immediately sends
the service-disable message to nova-api, and similarly send
service-enable during the 'start' action.  However this probably has a
few downsides:

1. It could cause rapid flapping of the service state server-side (at
   least disable followed quickly by enable, or more if it took
   multiple retries to successfully restart nova-compute), and extra
   associated noise/load on nova-api and the MQ and DB.
2. It would slow down recovery.
3. What happens if whatever is causing nova-compute to fail is also
   causing nova-api to be unreachable from this compute node?

So as you say, the intended optimization here is to make the
stop-start cycle faster and more lightweight than the final stop.

> I am not sure the current implementation described above is sufficient,
> but it should be a good starting point to work from.

Hopefully, but you've raised more questions in my head :-)

For example, I think there are probably other use cases, e.g.

- Take configurable action after failure to restart libvirtd
  (one possible action is fencing the node; another is to
  notify the cloud operator)

- neutron-l3-agent RA detects that the agent is unhealthy, and iff it
  fails to restart it, we want to trigger migration of any routers on
  that l3-agent to a healthy l3-agent.  Currently we wait for the
  connection between the agent and the neutron server to time out,
  which is unpleasantly slow.  This case is more of a requirement than
  an optimization, because we really don't want to migrate routers to
  another node unless we have to, because a) it takes time, and b) is
  disruptive enough that we don't want to have to migrate them back
  soon after if we discover we can successfully recover the unhealthy
  l3-agent.

- Remove a failed backend from an haproxy-fronted service if
  it can't be restarted.

- Notify any other service (OpenStack or otherwise) where the failing
  local resource is a backend worker for some central service.  I
  guess ceilometer, cinder, mistral etc. are all potential
  examples of this.

Finally, there's the fundamental question when responsibility of
monitoring and cleaning up after failures should be handled by
Pacemaker and OCF RAs, or whether sometimes a central service should
handle that itself.  For example we could tune the nova / neutron
agent timeouts to be much more aggressive, and then those servers
would notice agent failures themselves quick enough that we wouldn't
have to configure Pacemaker to detect them and then notify the
servers.

I'm not sure if there is any good reason why Pacemaker can more
reliably detect failures than those native keepalive mechanisms.  The
main difference appears to be that Pacemaker executes monitoring
directly on the monitored node via lrmd, and then relays the results
back via corosync, whereas server/agent heartbeating typically relies
on the state of a simple TCP connection.  In that sense, Pacemaker is
more flexible in what it can monitor, and the monitoring may also take
place over different networks depending on the configuration.  And of
course it can do fencing when this is required.  But in the cases
where more sophisticated monitoring and fencing are not required,
I wonder if this is worth the added complexity.  Thoughts?

Thanks!
Adam