[ClusterLabs] Pacemaker tries to demote resource that isn't running and returns OCF_FAILED_MASTER
andrew at beekhof.net
Fri Aug 28 00:14:43 EDT 2015
> On 21 Aug 2015, at 1:32 pm, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> 21.08.2015 00:35, Brian Campbell пишет:
>> I have a master/slave resource (with a custom resource agent) which,
>> if it uncleanly shut down, will return OCF_FAILED_MASTER on the next
>> "monitor" operation. This seems to be what
>> suggests that exit code should be used for.
>> After the node is fenced, and comes up again, Pacemaker probes all of
>> the resources. It gets the OCF_FAILED_MASTER exit code, and decides
>> that it needs to demote the resource. So it executes the demote
>> action. My resource agent returns an error on a demote action if it is
>> not running, which seems to be the suggested behavior according to
>> This then causes Pacemaker to log a failure for the "demote" action,
>> and then try to recover by stopping (which succeeds cleanly because
>> the resource is stopped) followed by starting it again (which again
>> succeeds, as we can start in slave mode from a failed state). So the
>> end state is correct, but crm_mon shows a failed action that you need
>> to clear out:
>> Failed actions:
>> (node=es-efs-master2, call=73, rc=1, status=complete, l
>> ast-rc-change=Thu Aug 20 12:52:21 2015
>> , queued=54ms, exec=1ms
>> ): unknown error
>> I'm curious about whether the behavior of my resource agent is
>> correct. Should I not be returning OCF_FAILED_MASTER upon the
>> "monitor" operation if the resource isn't started?
> Correct. If resource is not started it cannot be master or slave; it can become master only after pacemaker requested it. Unexpected master would be just the same error as well.
> If you can determine that one resource instance is more suitable to become master than another one, you should set master score respectively so pacemaker will promote correct instance.
>> Or should the
>> "demote" operation do something different in this state, like actually
>> starting up the slave?
> In general, if current resource state is the same as would be after operation is completed, there is absolutely no reason to return error - just pretend operation succeeded.
Always return the actual state. ie. OCF_NOT_RUNNING in these two cases.
Only return OCF_FAILED_MASTER if you know enough to say that its in the master state (ie. lock file, or similar mechanism) but not able to handle requests.
>> It seems like the behavior of Pacemaker is different than what's
>> documented in the resource agent guide, so I'm trying to figure out if
>> this is a bug in my resource agent, a bug in Pacemaker, a
>> misunderstanding on my part, or actually intended behavior.
>> -- Brian
>> Users mailing list: Users at clusterlabs.org
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> Users mailing list: Users at clusterlabs.org
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users