[ClusterLabs] Pacemaker tries to demote resource that isn't running and returns OCF_FAILED_MASTER

Fri Aug 28 15:24:54 UTC 2015

On Fri, Aug 28, 2015 at 12:14 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>
>> On 21 Aug 2015, at 1:32 pm, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>>
>> 21.08.2015 00:35, Brian Campbell пишет:
>>> I have a master/slave resource (with a custom resource agent) which,
>>> if it uncleanly shut down, will return OCF_FAILED_MASTER on the next
>>> "monitor" operation. This seems to be what
>>> http://www.linux-ha.org/doc/dev-guides/_literal_ocf_failed_master_literal_9.html
>>> suggests that exit code should be used for.
>>>
>>> After the node is fenced, and comes up again, Pacemaker probes all of
>>> the resources. It gets the OCF_FAILED_MASTER exit code, and decides
>>> that it needs to demote the resource. So it executes the demote
>>> action. My resource agent returns an error on a demote action if it is
>>> not running, which seems to be the suggested behavior according to
>>> http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html
>>>
>>> This then causes Pacemaker to log a failure for the "demote" action,
>>> and then try to recover by stopping (which succeeds cleanly because
>>> the resource is stopped) followed by starting it again (which again
>>> succeeds, as we can start in slave mode from a failed state). So the
>>> end state is correct, but crm_mon shows a failed action that you need
>>> to clear out:
>>>
>>> Failed actions:
>>>     editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_demote_0
>>> (node=es-efs-master2, call=73, rc=1, status=complete, l
>>> ast-rc-change=Thu Aug 20 12:52:21 2015
>>> , queued=54ms, exec=1ms
>>> ): unknown error
>>>
>>> I'm curious about whether the behavior of my resource agent is
>>> correct. Should I not be returning OCF_FAILED_MASTER upon the
>>> "monitor" operation if the resource isn't started?
>>
>> Correct. If resource is not started it cannot be master or slave; it can become master only after pacemaker requested it. Unexpected master would be just the same error as well.
>>
>> If you can determine that one resource instance is more suitable to become master than another one, you should set master score respectively so pacemaker will promote correct instance.
>>
>>>                                                   Or should the
>>> "demote" operation do something different in this state, like actually
>>> starting up the slave?
>>>
>>
>> In general, if current resource state is the same as would be after operation is completed, there is absolutely no reason to return error - just pretend operation succeeded.
>
> Always return the actual state. ie. OCF_NOT_RUNNING in these two cases.
>
> Only return OCF_FAILED_MASTER if you know enough to say that its in the master state (ie. lock file, or similar mechanism) but not able to handle requests.

Thanks for the clarifications!

So it sounds like I should be returning OCF_NOT_RUNNING from the
monitor operation even if I detect that it was uncleanly shut down in
the master state earlier, and only return OCF_FAILED_MASTER if it is
running in the master state but failed for some reason, so it needs a
demote or stop.

-- Brian