[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Thu May 19 12:43:55 EDT 2016

Le Thu, 19 May 2016 10:53:31 -0500,
Ken Gaillot <kgaillot at redhat.com> a écrit :

> A recent thread discussed a proposed new feature, a new environment
> variable that would be passed to resource agents, indicating whether a
> stop action was part of a recovery.
> 
> Since that thread was long and covered a lot of topics, I'm starting a
> new one to focus on the core issue remaining:
> 
> The original idea was to pass the number of restarts remaining before
> the resource will no longer tried to be started on the same node. This
> involves calculating (fail-count - migration-threshold), and that
> implies certain limitations: (1) it will only be set when the cluster
> checks migration-threshold; (2) it will only be set for the failed
> resource itself, not for other resources that may be recovered due to
> dependencies on it.
> 
> Ulrich Windl proposed an alternative: setting a boolean value instead. I
> forgot to cc the list on my reply, so I'll summarize now: We would set a
> new variable like OCF_RESKEY_CRM_recovery=true whenever a start is
> scheduled after a stop on the same node in the same transition. This
> would avoid the corner cases of the previous approach; instead of being
> tied to migration-threshold, it would be set whenever a recovery was
> being attempted, for any reason. And with this approach, it should be
> easier to set the variable for all actions on the resource
> (demote/stop/start/promote), rather than just the stop.

I can see the value of having such variable during various actions. However, we
can also deduce the transition is a recovering during the notify actions with
the notify variables (the only information we lack is the order of the
actions). A most flexible approach would be to make sure the notify variables
are always available during the whole transaction for **all** actions, not just
notify. It seems like it's already the case, but a recent discussion emphase
this is just a side effect of the current implementation. I understand this as 
they were sometime available outside of notification "by accident".

Also, I can see the benefit of having the remaining attempt for the current
action before hitting the migration-threshold. I might misunderstand something
here, but it seems to me both informations are different. 

Basically, what we need is a better understanding of the transition itself
from the RA actions.

If you are still brainstorming on this, as a RA dev, what I would
suggest is:

  * provide and enforce the notify variables in all actions
  * add the actions order during the current transition to these variables using
    eg. OCF_RESKEY_CRM_meta_notify_*_actionid
  * add a new variable with remaining action attempt before migration. This one
    has the advantage to survive the transition breakage when a failure occurs.

On a second step, we would be able to provide some helper function in the
ocf_shellfuncs (and in my perl module equivalent) to compute if the transition
is a switchover, a failover, a recovery, etc, based on the notify variables.

Presently, I am detecting such scenarios directly in my RA during the notify
actions and tracking them as private attributes to be aware of the situation 
during the real actions (demote and stop). See:

https://github.com/dalibo/PAF/blob/952cb3cf2f03aad18fbeafe3a91f997a56c3b606/script/pgsqlms#L95

Regards,