[ClusterLabs] Pacemaker tries to demote resource that isn't running and returns OCF_FAILED_MASTER

Thu Aug 20 17:35:39 EDT 2015

I have a master/slave resource (with a custom resource agent) which,
if it uncleanly shut down, will return OCF_FAILED_MASTER on the next
"monitor" operation. This seems to be what
http://www.linux-ha.org/doc/dev-guides/_literal_ocf_failed_master_literal_9.html
suggests that exit code should be used for.

After the node is fenced, and comes up again, Pacemaker probes all of
the resources. It gets the OCF_FAILED_MASTER exit code, and decides
that it needs to demote the resource. So it executes the demote
action. My resource agent returns an error on a demote action if it is
not running, which seems to be the suggested behavior according to
http://www.linux-ha.org/doc/dev-guides/_literal_demote_literal_action.html

This then causes Pacemaker to log a failure for the "demote" action,
and then try to recover by stopping (which succeeds cleanly because
the resource is stopped) followed by starting it again (which again
succeeds, as we can start in slave mode from a failed state). So the
end state is correct, but crm_mon shows a failed action that you need
to clear out:

Failed actions:
    editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_demote_0
(node=es-efs-master2, call=73, rc=1, status=complete, l
ast-rc-change=Thu Aug 20 12:52:21 2015
, queued=54ms, exec=1ms
): unknown error

I'm curious about whether the behavior of my resource agent is
correct. Should I not be returning OCF_FAILED_MASTER upon the
"monitor" operation if the resource isn't started? Or should the
"demote" operation do something different in this state, like actually
starting up the slave?

It seems like the behavior of Pacemaker is different than what's
documented in the resource agent guide, so I'm trying to figure out if
this is a bug in my resource agent, a bug in Pacemaker, a
misunderstanding on my part, or actually intended behavior.

-- Brian