[ClusterLabs Developers] Why returning OCF_ERR_GENERIC during demote if resource stopped?

Jehan-Guillaume de Rorthais jgdr at dalibo.com
Mon May 16 10:55:58 UTC 2016


Le Mon, 16 May 2016 13:15:11 +1000,
Andrew Beekhof <andrew at beekhof.net> a écrit :

> 
> > On 28 Apr 2016, at 7:26 PM, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> > wrote:
> > 
> > Hello all,
> > 
> > According to the developers guide, when calling demote on a stopped
> > resources, the RA should returns a soft error:
> > 
> > http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html
> > 
> >  «
> >  foobar_monitor
> >  rc=$?
> >  case "$rc" in
> >  [...]
> >      "$OCF_NOT_RUNNING")
> >          # Currently not running. Getting a demote action
> >          # in this state is unexpected. Exit with an error
> >          # and let the cluster manager recover.
> >          ocf_log err "Resource is currently not running"
> >          exit $OCF_ERR_GENERIC
> >          ;;
> >  [...]
> >  »
> > 
> > But to recover a master resource that is fount not running, PEngine produce
> > a transition with the following actions: demote -> stop -> start -> promote.
> > 
> > If we follow the dev guide, the recover action is not possible on a
> > stopped master as the first action of the transition will always fail,
> > leading to a migration and a -inf score on the old master node.
> > 
> > My first though was «why doing a demote -> stop that breaks everything when
> > it knows the resource is already stopped?!»
> > 
> > If I understand correctly, I guess PEngine **must** produce such a
> > transition so the notify actions are triggered should other leaving clone
> > need to process them. Is it right?
> 
> Yes, also because in theory there could be some cleanup that needs to happen.
> 
> > If this is right, then maybe we should relax a bit what is
> > written in the ocf dev guide?
> 
> I would change that block use to
> 
> exit $OCF_NOT_RUNNING
> 
> Because we don’t know for sure that the stop will happen

I suppose returning OCF_NOT_RUNNING from the demote action would break the
current transition as the CRM is expecting a OCF_SUCCESS, isn't it? Or does the
CRM conclude it does not need to run the next stop action?

I am worried about breaking a transition as we rely on notify vars to detect
recover action of a slave, a master or a master move.

For a master or a slave recover, we need to run some cleanup action on
PostgreSQL suie. If we break the original transition, the new transition
**might** (if the new transition is actually different) look like a normal
master start->promote.

Regards,




More information about the Developers mailing list