[ClusterLabs Developers] Notify actions

Tue Feb 18 11:11:05 EST 2020

On Thu, 2020-02-13 at 00:15 +0100, Jehan-Guillaume de Rorthais wrote:
> On Wed, 12 Feb 2020 15:17:01 -0600
> Ken Gaillot <kgaillot at redhat.com> wrote:
> 
> > On Wed, 2020-02-12 at 16:03 +0100, Jehan-Guillaume de Rorthais
> > wrote:
> > > Hello devs,
> > > 
> > > I have a few questions about notify action.
> > > 
> > > 1. notify clone option
> > > 
> > > Why a clone option exist to enable or disable them, and why is it
> > > false by
> > > default?  
> > 
> > Most likely to preserve backward-compatible behavior when
> > notifications
> > were first implemented.
> 
> ok
> 
> > > As notify is an action, I suppose it should be enabled by default
> > > if
> > > the RA
> > > claim to support it in its meta-data action. No need a clone
> > > option
> > > for this.
> > > 
> > > Moreover, if one need to deactivate it for some reason, I suppose
> > > the
> > > proper
> > > way to do it would be to set "enable=false" as the notify
> > > operation
> > > property
> > > for the given resource.  
> > 
> > That would have been a better design. :) I'm not sure whether that
> > would work currently.
> 
> Why?

I'm pretty sure the code only checks "enable" for recurring operations
at the moment. I.e. you can't disable start or stop.

> 
> [...]
> > > 2. return code
> > > 
> > > Why the return code from notify is ignored from the cluster?
> > > 
> > > As discussed on IRC and by emails (IIRC),
> > > OCF_RESKEY_CRM_meta_notify_* are
> > > available during notify action. These informations are useful for
> > > clones or
> > > promotable resources to detect some wrong actions and raise an
> > > error
> > > so the
> > > cluster try another transition.  
> > 
> > I'm not sure, but I'm guessing one issue is that notifications are
> > called before and after stop. If a stop-related notify failed, the
> > node
> > would likely have to be fenced, just as if the stop itself failed.
> 
> Notify is not called after a stop on the node the resource has been
> stopped. If

That brings up another point, notifications are run on all nodes with
the clone.

If a notify fails, should it be considered a failure of the resource
(raising the fail count etc.), or just a failure of the transition
step?

If it should be considered a resource failure, is it a failure on the
node where the real action is taking place, or where the notify is
taking place? If it's where the notify is taking place, that could lead
to attempted recovery on that node, which is likely to be completely
unrelated to why a notification would fail.

If it should be considered only a transition failure, what's to prevent
an infinite loop? The cluster would reschedule the same actions.

> the stop succeed, no notify is called on this node. If the stop
> failed, well
> fencing anyway, no time for notify.
> 
> Same for start, AFAIR notify is not called for pre-start, only post-
> start.
> 
-- 
Ken Gaillot <kgaillot at redhat.com>