[ClusterLabs Developers] Why returning OCF_ERR_GENERIC during demote if resource stopped?
andrew at beekhof.net
Mon Jun 27 02:33:05 EDT 2016
> On 16 May 2016, at 8:55 PM, Jehan-Guillaume de Rorthais <jgdr at dalibo.com> wrote:
> Le Mon, 16 May 2016 13:15:11 +1000,
> Andrew Beekhof <andrew at beekhof.net <mailto:andrew at beekhof.net>> a écrit :
>>> On 28 Apr 2016, at 7:26 PM, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
>>> Hello all,
>>> According to the developers guide, when calling demote on a stopped
>>> resources, the RA should returns a soft error:
>>> case "$rc" in
>>> # Currently not running. Getting a demote action
>>> # in this state is unexpected. Exit with an error
>>> # and let the cluster manager recover.
>>> ocf_log err "Resource is currently not running"
>>> exit $OCF_ERR_GENERIC
>>> But to recover a master resource that is fount not running, PEngine produce
>>> a transition with the following actions: demote -> stop -> start -> promote.
>>> If we follow the dev guide, the recover action is not possible on a
>>> stopped master as the first action of the transition will always fail,
>>> leading to a migration and a -inf score on the old master node.
>>> My first though was «why doing a demote -> stop that breaks everything when
>>> it knows the resource is already stopped?!»
>>> If I understand correctly, I guess PEngine **must** produce such a
>>> transition so the notify actions are triggered should other leaving clone
>>> need to process them. Is it right?
>> Yes, also because in theory there could be some cleanup that needs to happen.
>>> If this is right, then maybe we should relax a bit what is
>>> written in the ocf dev guide?
>> I would change that block use to
>> exit $OCF_NOT_RUNNING
>> Because we don’t know for sure that the stop will happen
> I suppose returning OCF_NOT_RUNNING from the demote action would break the
> current transition as the CRM is expecting a OCF_SUCCESS, isn't it?
Same as returning $OCF_ERR_GENERIC, yes.
> Or does the
> CRM conclude it does not need to run the next stop action?
I forget what the current semantics are, the PE may indeed decide not to schedule a stop action when it recomputes.
> I am worried about breaking a transition as we rely on notify vars to detect
> recover action of a slave, a master or a master move.
You can’t avoid it, unless you lie and return $OCF_SUCCESS.
> For a master or a slave recover, we need to run some cleanup action on
> PostgreSQL suie.
That would be an argument to change the monitor action to return OCF_ERR_GENERIC if postgres isn’t running BUT cleanup IS needed and reserve OCF_NOT_RUNNING for when everything is cleanly stopped.
> If we break the original transition, the new transition
> **might** (if the new transition is actually different) look like a normal
> master start->promote.
Not possible. There will be a failed action in there so it won’t look normal.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Developers