[ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

Thu May 19 20:15:20 CEST 2016

On 05/19/2016 11:43 AM, Jehan-Guillaume de Rorthais wrote:
> Le Thu, 19 May 2016 10:53:31 -0500,
> Ken Gaillot <kgaillot at redhat.com> a écrit :
> 
>> A recent thread discussed a proposed new feature, a new environment
>> variable that would be passed to resource agents, indicating whether a
>> stop action was part of a recovery.
>>
>> Since that thread was long and covered a lot of topics, I'm starting a
>> new one to focus on the core issue remaining:
>>
>> The original idea was to pass the number of restarts remaining before
>> the resource will no longer tried to be started on the same node. This
>> involves calculating (fail-count - migration-threshold), and that
>> implies certain limitations: (1) it will only be set when the cluster
>> checks migration-threshold; (2) it will only be set for the failed
>> resource itself, not for other resources that may be recovered due to
>> dependencies on it.
>>
>> Ulrich Windl proposed an alternative: setting a boolean value instead. I
>> forgot to cc the list on my reply, so I'll summarize now: We would set a
>> new variable like OCF_RESKEY_CRM_recovery=true whenever a start is
>> scheduled after a stop on the same node in the same transition. This
>> would avoid the corner cases of the previous approach; instead of being
>> tied to migration-threshold, it would be set whenever a recovery was
>> being attempted, for any reason. And with this approach, it should be
>> easier to set the variable for all actions on the resource
>> (demote/stop/start/promote), rather than just the stop.
> 
> I can see the value of having such variable during various actions. However, we
> can also deduce the transition is a recovering during the notify actions with
> the notify variables (the only information we lack is the order of the
> actions). A most flexible approach would be to make sure the notify variables
> are always available during the whole transaction for **all** actions, not just
> notify. It seems like it's already the case, but a recent discussion emphase
> this is just a side effect of the current implementation. I understand this as 
> they were sometime available outside of notification "by accident".

It does seem that a recovery could be implied from the
notify_{start,stop}_uname variables, but notify variables are only set
for clones that support the notify action. I think the goal here is to
work with any resource type. Even for clones, if they don't otherwise
need notifications, they'd have to add the overhead of notify calls on
all instances, that would do nothing.

> Also, I can see the benefit of having the remaining attempt for the current
> action before hitting the migration-threshold. I might misunderstand something
> here, but it seems to me both informations are different. 

I think the use cases that have been mentioned would all be happy with
just the boolean. Does anyone need the actual count, or just whether
this is a stop-start vs a full stop?

The problem with the migration-threshold approach is that there are
recoveries that will be missed because they don't involve
migration-threshold. If the count is really needed, the
migration-threshold approach is necessary, but if recovery is the really
interesting information, then a boolean would be more accurate.

> Basically, what we need is a better understanding of the transition itself
> from the RA actions.
> 
> If you are still brainstorming on this, as a RA dev, what I would
> suggest is:
> 
>   * provide and enforce the notify variables in all actions
>   * add the actions order during the current transition to these variables using
>     eg. OCF_RESKEY_CRM_meta_notify_*_actionid

The action ID would be different for each node being acted on, so it
would be more complicated (maybe *_actions="NODE1:ID1,NODE2:ID2,..."?).
Also, RA writers would need to be aware that some actions may be
initiated in parallel. Probably more complex than it's worth.

>   * add a new variable with remaining action attempt before migration. This one
>     has the advantage to survive the transition breakage when a failure occurs.
> 
> On a second step, we would be able to provide some helper function in the
> ocf_shellfuncs (and in my perl module equivalent) to compute if the transition
> is a switchover, a failover, a recovery, etc, based on the notify variables.
> 
> Presently, I am detecting such scenarios directly in my RA during the notify
> actions and tracking them as private attributes to be aware of the situation 
> during the real actions (demote and stop). See:
> 
> https://github.com/dalibo/PAF/blob/952cb3cf2f03aad18fbeafe3a91f997a56c3b606/script/pgsqlms#L95
> 
> Regards,
>