[ClusterLabs] Pacemaker tries to demote resource that isn't running and returns OCF_FAILED_MASTER

Mon Aug 31 01:21:13 UTC 2015

> On 29 Aug 2015, at 1:24 am, Brian Campbell <brian.campbell at editshare.com> wrote:
> 
> On Fri, Aug 28, 2015 at 12:14 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>> 
>>> On 21 Aug 2015, at 1:32 pm, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>>> 
>>> 21.08.2015 00:35, Brian Campbell пишет:
>>>> I have a master/slave resource (with a custom resource agent) which,
>>>> if it uncleanly shut down, will return OCF_FAILED_MASTER on the next
>>>> "monitor" operation. This seems to be what
>>>> http://www.linux-ha.org/doc/dev-guides/_literal_ocf_failed_master_literal_9.html
>>>> suggests that exit code should be used for.
>>>> 
>>>> After the node is fenced, and comes up again, Pacemaker probes all of
>>>> the resources. It gets the OCF_FAILED_MASTER exit code, and decides
>>>> that it needs to demote the resource. So it executes the demote
>>>> action. My resource agent returns an error on a demote action if it is
>>>> not running, which seems to be the suggested behavior according to
>>>> http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html
>>>> 
>>>> This then causes Pacemaker to log a failure for the "demote" action,
>>>> and then try to recover by stopping (which succeeds cleanly because
>>>> the resource is stopped) followed by starting it again (which again
>>>> succeeds, as we can start in slave mode from a failed state). So the
>>>> end state is correct, but crm_mon shows a failed action that you need
>>>> to clear out:
>>>> 
>>>> Failed actions:
>>>>    editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_demote_0
>>>> (node=es-efs-master2, call=73, rc=1, status=complete, l
>>>> ast-rc-change=Thu Aug 20 12:52:21 2015
>>>> , queued=54ms, exec=1ms
>>>> ): unknown error
>>>> 
>>>> I'm curious about whether the behavior of my resource agent is
>>>> correct. Should I not be returning OCF_FAILED_MASTER upon the
>>>> "monitor" operation if the resource isn't started?
>>> 
>>> Correct. If resource is not started it cannot be master or slave; it can become master only after pacemaker requested it. Unexpected master would be just the same error as well.
>>> 
>>> If you can determine that one resource instance is more suitable to become master than another one, you should set master score respectively so pacemaker will promote correct instance.
>>> 
>>>>                                                  Or should the
>>>> "demote" operation do something different in this state, like actually
>>>> starting up the slave?
>>>> 
>>> 
>>> In general, if current resource state is the same as would be after operation is completed, there is absolutely no reason to return error - just pretend operation succeeded.
>> 
>> Always return the actual state. ie. OCF_NOT_RUNNING in these two cases.
>> 
>> Only return OCF_FAILED_MASTER if you know enough to say that its in the master state (ie. lock file, or similar mechanism) but not able to handle requests.
> 
> Thanks for the clarifications!
> 
> So it sounds like I should be returning OCF_NOT_RUNNING from the
> monitor operation even if I detect that it was uncleanly shut down in
> the master state earlier,

It really depends on if you need any cleanup to happen.
Need cleanup: OCF_FAILED_MASTER
_Safely_ stopped:   OCF_NOT_RUNNING

> and only return OCF_FAILED_MASTER if it is
> running in the master state but failed for some reason, so it needs a
> demote or stop.
> 
> -- Brian
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org