[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?
abeekhof at redhat.com
Thu Jun 2 21:01:25 EDT 2016
On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
> A recent thread discussed a proposed new feature, a new environment
> variable that would be passed to resource agents, indicating whether a
> stop action was part of a recovery.
> Since that thread was long and covered a lot of topics, I'm starting a
> new one to focus on the core issue remaining:
> The original idea was to pass the number of restarts remaining before
> the resource will no longer tried to be started on the same node. This
> involves calculating (fail-count - migration-threshold), and that
> implies certain limitations: (1) it will only be set when the cluster
> checks migration-threshold; (2) it will only be set for the failed
> resource itself, not for other resources that may be recovered due to
> dependencies on it.
> Ulrich Windl proposed an alternative: setting a boolean value instead. I
> forgot to cc the list on my reply, so I'll summarize now: We would set a
> new variable like OCF_RESKEY_CRM_recovery=true
This concept worries me, especially when what we've implemented is
The name alone encourages people to "optimise" the agent to not
actually stop the service "because its just going to start again
shortly". I know thats not what Adam would do, but not everyone
understands how clusters work.
There are any number of reasons why a cluster that intends to restart
a service may not do so. In such a scenario, a badly written agent
would cause the cluster to mistakenly believe that the service is
stopped - allowing it to start elsewhere.
Its true there are any number of ways to write bad agents, but I would
argue that we shouldn't be nudging people in that direction :)
> whenever a start is
> scheduled after a stop on the same node in the same transition. This
> would avoid the corner cases of the previous approach; instead of being
> tied to migration-threshold, it would be set whenever a recovery was
> being attempted, for any reason. And with this approach, it should be
> easier to set the variable for all actions on the resource
> (demote/stop/start/promote), rather than just the stop.
> I think the boolean approach fits all the envisioned use cases that have
> been discussed. Any objections to going that route instead of the count?
> Ken Gaillot <kgaillot at redhat.com>
> Users mailing list: Users at clusterlabs.org
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users