[ClusterLabs Developers] Why returning OCF_ERR_GENERIC during demote if resource stopped?

Andrew Beekhof andrew at beekhof.net
Mon May 16 03:15:11 UTC 2016


> On 28 Apr 2016, at 7:26 PM, Jehan-Guillaume de Rorthais <jgdr at dalibo.com> wrote:
> 
> Hello all,
> 
> According to the developers guide, when calling demote on a stopped resources,
> the RA should returns a soft error:
> 
> http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html
> 
>  «
>  foobar_monitor
>  rc=$?
>  case "$rc" in
>  [...]
>      "$OCF_NOT_RUNNING")
>          # Currently not running. Getting a demote action
>          # in this state is unexpected. Exit with an error
>          # and let the cluster manager recover.
>          ocf_log err "Resource is currently not running"
>          exit $OCF_ERR_GENERIC
>          ;;
>  [...]
>  »
> 
> But to recover a master resource that is fount not running, PEngine produce a
> transition with the following actions: demote -> stop -> start -> promote.
> 
> If we follow the dev guide, the recover action is not possible on a
> stopped master as the first action of the transition will always fail, leading
> to a migration and a -inf score on the old master node.
> 
> My first though was «why doing a demote -> stop that breaks everything when it
> knows the resource is already stopped?!»
> 
> If I understand correctly, I guess PEngine **must** produce such a transition
> so the notify actions are triggered should other leaving clone need to process
> them. Is it right?

Yes, also because in theory there could be some cleanup that needs to happen.

> If this is right, then maybe we should relax a bit what is
> written in the ocf dev guide?

I would change that block use to

exit $OCF_NOT_RUNNING

Because we don’t know for sure that the stop will happen

> 
> To be able to deal with this in our RA, if the resource is stopped during the
> demote action, we silently start it as a slave and return OCF_ERR_GENERIC If we
> couldn't start the resource. We return OCF_SUCCESS if it succeed (I guess we
> could juste return OCF_SUCCESS without starting it if the transition plans to
> stop it according to the notify variables). 
> 
> Comments? Advices? 
> 
> Regards,
> -- 
> Jehan-Guillaume de Rorthais
> Dalibo
> 
> _______________________________________________
> Developers mailing list
> Developers at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/developers





More information about the Developers mailing list