[ClusterLabs Developers] Why returning OCF_ERR_GENERIC during demote if resource stopped?

Jehan-Guillaume de Rorthais jgdr at dalibo.com
Thu Apr 28 09:26:19 UTC 2016


Hello all,

According to the developers guide, when calling demote on a stopped resources,
the RA should returns a soft error:

http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html

  «
  foobar_monitor
  rc=$?
  case "$rc" in
  [...]
      "$OCF_NOT_RUNNING")
          # Currently not running. Getting a demote action
          # in this state is unexpected. Exit with an error
          # and let the cluster manager recover.
          ocf_log err "Resource is currently not running"
          exit $OCF_ERR_GENERIC
          ;;
  [...]
  »

But to recover a master resource that is fount not running, PEngine produce a
transition with the following actions: demote -> stop -> start -> promote.

If we follow the dev guide, the recover action is not possible on a
stopped master as the first action of the transition will always fail, leading
to a migration and a -inf score on the old master node.

My first though was «why doing a demote -> stop that breaks everything when it
knows the resource is already stopped?!»

If I understand correctly, I guess PEngine **must** produce such a transition
so the notify actions are triggered should other leaving clone need to process
them. Is it right? If this is right, then maybe we should relax a bit what is
written in the ocf dev guide?

To be able to deal with this in our RA, if the resource is stopped during the
demote action, we silently start it as a slave and return OCF_ERR_GENERIC If we
couldn't start the resource. We return OCF_SUCCESS if it succeed (I guess we
could juste return OCF_SUCCESS without starting it if the transition plans to
stop it according to the notify variables). 

Comments? Advices? 

Regards,
-- 
Jehan-Guillaume de Rorthais
Dalibo




More information about the Developers mailing list