[ClusterLabs Developers] Why returning OCF_ERR_GENERIC during demote if resource stopped?
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Mon May 16 10:55:58 UTC 2016
Le Mon, 16 May 2016 13:15:11 +1000,
Andrew Beekhof <andrew at beekhof.net> a écrit :
>
> > On 28 Apr 2016, at 7:26 PM, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> > wrote:
> >
> > Hello all,
> >
> > According to the developers guide, when calling demote on a stopped
> > resources, the RA should returns a soft error:
> >
> > http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html
> >
> > «
> > foobar_monitor
> > rc=$?
> > case "$rc" in
> > [...]
> > "$OCF_NOT_RUNNING")
> > # Currently not running. Getting a demote action
> > # in this state is unexpected. Exit with an error
> > # and let the cluster manager recover.
> > ocf_log err "Resource is currently not running"
> > exit $OCF_ERR_GENERIC
> > ;;
> > [...]
> > »
> >
> > But to recover a master resource that is fount not running, PEngine produce
> > a transition with the following actions: demote -> stop -> start -> promote.
> >
> > If we follow the dev guide, the recover action is not possible on a
> > stopped master as the first action of the transition will always fail,
> > leading to a migration and a -inf score on the old master node.
> >
> > My first though was «why doing a demote -> stop that breaks everything when
> > it knows the resource is already stopped?!»
> >
> > If I understand correctly, I guess PEngine **must** produce such a
> > transition so the notify actions are triggered should other leaving clone
> > need to process them. Is it right?
>
> Yes, also because in theory there could be some cleanup that needs to happen.
>
> > If this is right, then maybe we should relax a bit what is
> > written in the ocf dev guide?
>
> I would change that block use to
>
> exit $OCF_NOT_RUNNING
>
> Because we don’t know for sure that the stop will happen
I suppose returning OCF_NOT_RUNNING from the demote action would break the
current transition as the CRM is expecting a OCF_SUCCESS, isn't it? Or does the
CRM conclude it does not need to run the next stop action?
I am worried about breaking a transition as we rely on notify vars to detect
recover action of a slave, a master or a master move.
For a master or a slave recover, we need to run some cleanup action on
PostgreSQL suie. If we break the original transition, the new transition
**might** (if the new transition is actually different) look like a normal
master start->promote.
Regards,
More information about the Developers
mailing list