[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Thu May 19 17:53:31 CEST 2016

A recent thread discussed a proposed new feature, a new environment
variable that would be passed to resource agents, indicating whether a
stop action was part of a recovery.

Since that thread was long and covered a lot of topics, I'm starting a
new one to focus on the core issue remaining:

The original idea was to pass the number of restarts remaining before
the resource will no longer tried to be started on the same node. This
involves calculating (fail-count - migration-threshold), and that
implies certain limitations: (1) it will only be set when the cluster
checks migration-threshold; (2) it will only be set for the failed
resource itself, not for other resources that may be recovered due to
dependencies on it.

Ulrich Windl proposed an alternative: setting a boolean value instead. I
forgot to cc the list on my reply, so I'll summarize now: We would set a
new variable like OCF_RESKEY_CRM_recovery=true whenever a start is
scheduled after a stop on the same node in the same transition. This
would avoid the corner cases of the previous approach; instead of being
tied to migration-threshold, it would be set whenever a recovery was
being attempted, for any reason. And with this approach, it should be
easier to set the variable for all actions on the resource
(demote/stop/start/promote), rather than just the stop.

I think the boolean approach fits all the envisioned use cases that have
been discussed. Any objections to going that route instead of the count?
-- 
Ken Gaillot <kgaillot at redhat.com>