[ClusterLabs Developers] Why returning OCF_ERR_GENERIC during demote if resource stopped?

Mon Jun 27 02:33:05 EDT 2016

> On 16 May 2016, at 8:55 PM, Jehan-Guillaume de Rorthais <jgdr at dalibo.com> wrote:
> 
> Le Mon, 16 May 2016 13:15:11 +1000,
> Andrew Beekhof <andrew at beekhof.net <mailto:andrew at beekhof.net>> a écrit :
> 
>> 
>>> On 28 Apr 2016, at 7:26 PM, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
>>> wrote:
>>> 
>>> Hello all,
>>> 
>>> According to the developers guide, when calling demote on a stopped
>>> resources, the RA should returns a soft error:
>>> 
>>> http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html
>>> 
>>> «
>>> foobar_monitor
>>> rc=$?
>>> case "$rc" in
>>> [...]
>>>     "$OCF_NOT_RUNNING")
>>>         # Currently not running. Getting a demote action
>>>         # in this state is unexpected. Exit with an error
>>>         # and let the cluster manager recover.
>>>         ocf_log err "Resource is currently not running"
>>>         exit $OCF_ERR_GENERIC
>>>         ;;
>>> [...]
>>> »
>>> 
>>> But to recover a master resource that is fount not running, PEngine produce
>>> a transition with the following actions: demote -> stop -> start -> promote.
>>> 
>>> If we follow the dev guide, the recover action is not possible on a
>>> stopped master as the first action of the transition will always fail,
>>> leading to a migration and a -inf score on the old master node.
>>> 
>>> My first though was «why doing a demote -> stop that breaks everything when
>>> it knows the resource is already stopped?!»
>>> 
>>> If I understand correctly, I guess PEngine **must** produce such a
>>> transition so the notify actions are triggered should other leaving clone
>>> need to process them. Is it right?
>> 
>> Yes, also because in theory there could be some cleanup that needs to happen.
>> 
>>> If this is right, then maybe we should relax a bit what is
>>> written in the ocf dev guide?
>> 
>> I would change that block use to
>> 
>> exit $OCF_NOT_RUNNING
>> 
>> Because we don’t know for sure that the stop will happen
> 
> I suppose returning OCF_NOT_RUNNING from the demote action would break the
> current transition as the CRM is expecting a OCF_SUCCESS, isn't it?

Same as returning $OCF_ERR_GENERIC, yes.

> Or does the
> CRM conclude it does not need to run the next stop action?

I forget what the current semantics are, the PE may indeed decide not to schedule a stop action when it recomputes.

> 
> I am worried about breaking a transition as we rely on notify vars to detect
> recover action of a slave, a master or a master move.

You can’t avoid it, unless you lie and return $OCF_SUCCESS.

> 
> For a master or a slave recover, we need to run some cleanup action on
> PostgreSQL suie.

That would be an argument to change the monitor action to return OCF_ERR_GENERIC if postgres isn’t running BUT cleanup IS needed and reserve OCF_NOT_RUNNING for when everything is cleanly stopped.

> If we break the original transition, the new transition
> **might** (if the new transition is actually different) look like a normal
> master start->promote.

Not possible. There will be a failed action in there so it won’t look normal.

> 
> Regards,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/developers/attachments/20160627/9c63efd1/attachment-0002.html>