[ClusterLabs] Resource not starting correctly

Mon Apr 15 18:30:42 EDT 2019

On Mon, 2019-04-15 at 14:15 -0600, JCA wrote:
> I have a simple two-node cluster, node one and node two, with a
> single resource, ClusterMyApp. The nodes are CentOS 7 VMs. The
> resource is created executing the following line in node one:
> 
>    # pcs resource create ClusterMyApp ocf:myapp:myapp-script op
> monitor interval=30s

FYI, it doesn't matter which node you change the configuration on --
it's automatically sync'd across all nodes.

> This invokes myapp-script, which I installed under
> /usr/lib/ocf/resource.d/myapp/myapp-script, both in one and two -
> i.e. it is exactly the same script in both nodes. 
> 
> On executing the command above in node one, I get the following log
> entries in node one itself:
> 
> Apr 15 13:40:12 one crmd[13670]:  notice: Result of probe operation
> for ClusterMyApp on one: 7 (not running)
> Apr 15 13:40:12 one crmd[13670]:  notice: Result of start operation
> for ClusterMyApp on one: 0 (ok)

Whenever starting the cluster on a node, or adding a resource,
pacemaker probes the state of all resources on the node (or all nodes
in the case of adding a resource) by calling the agent's "monitor"
action once. You'll see this "Result of probe operation" for all
resources on all nodes.

This allows pacemaker to detect if and where a service is already
running.

> This is in line with what I expect from myapp-script when invoked
> with the 'start' option (which is what the command above is doing.)
> myapp-script first checks out whether my app is running, and if it is
> not then launches it. The rest of the log entries are to do with my
> app, indicating that it started without any problems.
> 
> In node two, when the command above is executed in one, the following
> log entries are generated:
> 
> Apr 15 13:40:12 two crmd[4293]:  notice: State transition S_IDLE ->
> S_POLICY_ENGINE
> Apr 15 13:40:12 two pengine[4292]:  notice:  * Start     
> ClusterMyApp     (one )
> Apr 15 13:40:12 two pengine[4292]:  notice: Calculated transition 16,
> saving inputs in /var/lib/pacemaker/pengine/pe-input-66.bz2

At any given time, one node in the cluster is elected the "Designated
Controller" (DC). That node will calculate what (if anything) needs to
be done, and tell the right nodes to do it. Above, it has determined
that ClusterMyApp needs to be started on node one.

> Apr 15 13:40:12 two crmd[4293]:  notice: Initiating monitor operation
> ClusterMyApp_monitor_0 locally on two
> Apr 15 13:40:12 two crmd[4293]:  notice: Initiating monitor operation
> ClusterMyApp_monitor_0 on one

The cluster first probes the current state of the service on both
nodes, before any actions have been taken. The expected result is "not
running".

> Apr 15 13:40:12 two crmd[4293]:  notice: Result of probe operation
> for ClusterMyApp on two: 7 (not running)
> Apr 15 13:40:12 two crmd[4293]:  notice: Initiating start operation
> ClusterMyApp_start_0 on one
> Apr 15 13:40:12 two crmd[4293]:  notice: Initiating monitor operation
> ClusterMyApp_monitor_30000 on one
> Apr 15 13:40:12 two crmd[4293]: warning: Action 4
> (ClusterMyApp_monitor_30000) on one failed (target: 0 vs. rc: 7):
> Error
> Apr 15 13:40:12 two crmd[4293]:  notice: Transition aborted by
> operation ClusterMyApp_monitor_30000 'create' on one: Event failed

The cluster successfully probed the service on both nodes, and started
it on node one. It then tried to start a 30-second recurring monitor
for the service, but the monitor immediately failed (the expected
result was running, but the monitor said it was not running).

> After doing all of the above, pcs status returns the following, when
> invoked in either node:
> 
> Cluster name: MyCluster
> Stack: corosync
> Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition
> with quorum
> Last updated: Mon Apr 15 13:45:14 2019
> Last change: Mon Apr 15 13:40:11 2019 by root via cibadmin on one
> 
> 2 nodes configured
> 1 resource configured
> 
> Online: [ one two ]
> 
> Full list of resources:
> 
>  ClusterMyApp	(ocf::myapp:myapp-script):	Started one
> 
> Failed Actions:
> * ClusterMyApp_monitor_30000 on one 'not running' (7): call=37,
> status=complete, exitreason='',
>     last-rc-change='Mon Apr 15 13:40:12 2019', queued=0ms, exec=105ms
> 
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> 
> The start function in this script is as follows:
> 
> myapp_start() {
>   myapp_conf_check
>   local diagnostic=$?
> 
>   if [ $diagnostic -ne $OCF_SUCCESS ]; then
>     return $diagnostic
>   fi
> 
>   myapp_monitor
> 
>   local state=$?
> 
>   case $state in
>     $OCF_SUCCESS)
>         return $OCF_SUCCESS
>       ;;
> 
>     $OCF_NOT_RUNNING)
>       myapp_launch > /dev/null 2>&1
>       if [ $?  -eq 0 ]; then
>         return $OCF_SUCCESS
>       fi
> 
>       return $OCF_ERR_GENERIC
>       ;;
> 
>     *)
>       return $state
>       ;;
>   esac
> }
> 
> I know for a fact that, in one, myapp_launch gets invoked, and that
> its exit value is 0. The function therefore returns OCF_SUCCESS, as
> it should. However, if I understand things correctly, the log entries
> in two seem to claim that the exit value of the script in one is
> OCF_NOT_RUNNING. 

The start succeeded. It's the recurring monitor that failed.

> 
> What's going on here? It's obviously something to do with myapp-
> script - but, what? 
-- 
Ken Gaillot <kgaillot at redhat.com>