[ClusterLabs] Resource not starting correctly

Mon Apr 15 20:02:06 EDT 2019

Thanks. See my comments interspersed below.

On Mon, Apr 15, 2019 at 4:30 PM Ken Gaillot <kgaillot at redhat.com> wrote:

> On Mon, 2019-04-15 at 14:15 -0600, JCA wrote:
> > I have a simple two-node cluster, node one and node two, with a
> > single resource, ClusterMyApp. The nodes are CentOS 7 VMs. The
> > resource is created executing the following line in node one:
> >
> >    # pcs resource create ClusterMyApp ocf:myapp:myapp-script op
> > monitor interval=30s
>
> FYI, it doesn't matter which node you change the configuration on --
> it's automatically sync'd across all nodes.
>
>
   OK.

> This invokes myapp-script, which I installed under
> > /usr/lib/ocf/resource.d/myapp/myapp-script, both in one and two -
> > i.e. it is exactly the same script in both nodes.
> >
> > On executing the command above in node one, I get the following log
> > entries in node one itself:
> >
> > Apr 15 13:40:12 one crmd[13670]:  notice: Result of probe operation
> > for ClusterMyApp on one: 7 (not running)
> > Apr 15 13:40:12 one crmd[13670]:  notice: Result of start operation
> > for ClusterMyApp on one: 0 (ok)
>
> Whenever starting the cluster on a node, or adding a resource,
> pacemaker probes the state of all resources on the node (or all nodes
> in the case of adding a resource) by calling the agent's "monitor"
> action once. You'll see this "Result of probe operation" for all
> resources on all nodes.
>
> This allows pacemaker to detect if and where a service is already
> running.
>
>
   OK.

> > This is in line with what I expect from myapp-script when invoked
> > with the 'start' option (which is what the command above is doing.)
> > myapp-script first checks out whether my app is running, and if it is
> > not then launches it. The rest of the log entries are to do with my
> > app, indicating that it started without any problems.
> >
> > In node two, when the command above is executed in one, the following
> > log entries are generated:
> >
> > Apr 15 13:40:12 two crmd[4293]:  notice: State transition S_IDLE ->
> > S_POLICY_ENGINE
> > Apr 15 13:40:12 two pengine[4292]:  notice:  * Start
> > ClusterMyApp     (one )
> > Apr 15 13:40:12 two pengine[4292]:  notice: Calculated transition 16,
> > saving inputs in /var/lib/pacemaker/pengine/pe-input-66.bz2
>
> At any given time, one node in the cluster is elected the "Designated
> Controller" (DC). That node will calculate what (if anything) needs to
> be done, and tell the right nodes to do it. Above, it has determined
> that ClusterMyApp needs to be started on node one.
>

  OK.

>
> > Apr 15 13:40:12 two crmd[4293]:  notice: Initiating monitor operation
> > ClusterMyApp_monitor_0 locally on two
> > Apr 15 13:40:12 two crmd[4293]:  notice: Initiating monitor operation
> > ClusterMyApp_monitor_0 on one
>
> The cluster first probes the current state of the service on both
> nodes, before any actions have been taken. The expected result is "not
> running".
>

   OK.

>
> > Apr 15 13:40:12 two crmd[4293]:  notice: Result of probe operation
> > for ClusterMyApp on two: 7 (not running)
> > Apr 15 13:40:12 two crmd[4293]:  notice: Initiating start operation
> > ClusterMyApp_start_0 on one
> > Apr 15 13:40:12 two crmd[4293]:  notice: Initiating monitor operation
> > ClusterMyApp_monitor_30000 on one
> > Apr 15 13:40:12 two crmd[4293]: warning: Action 4
> > (ClusterMyApp_monitor_30000) on one failed (target: 0 vs. rc: 7):
> > Error
> > Apr 15 13:40:12 two crmd[4293]:  notice: Transition aborted by
> > operation ClusterMyApp_monitor_30000 'create' on one: Event failed
>
> The cluster successfully probed the service on both nodes, and started
> it on node one. It then tried to start a 30-second recurring monitor
> for the service, but the monitor immediately failed (the expected
> result was running, but the monitor said it was not running).
>

    It failed, where? In one, I know for a fact that my app is running, as
reported by ps. I also know it has started correctly and is sitting there
for stuff to do - it depends on timers and external events. In two, of
course, it is not running.

>
> > After doing all of the above, pcs status returns the following, when
> > invoked in either node:
> >
> > Cluster name: MyCluster
> > Stack: corosync
> > Current DC: two (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition
> > with quorum
> > Last updated: Mon Apr 15 13:45:14 2019
> > Last change: Mon Apr 15 13:40:11 2019 by root via cibadmin on one
> >
> > 2 nodes configured
> > 1 resource configured
> >
> > Online: [ one two ]
> >
> > Full list of resources:
> >
> >  ClusterMyApp (ocf::myapp:myapp-script):      Started one
> >
> > Failed Actions:
> > * ClusterMyApp_monitor_30000 on one 'not running' (7): call=37,
> > status=complete, exitreason='',
> >     last-rc-change='Mon Apr 15 13:40:12 2019', queued=0ms, exec=105ms
> >
> >
> > Daemon Status:
> >   corosync: active/disabled
> >   pacemaker: active/disabled
> >   pcsd: active/enabled
> >
> > The start function in this script is as follows:
> >
> > myapp_start() {
> >   myapp_conf_check
> >   local diagnostic=$?
> >
> >   if [ $diagnostic -ne $OCF_SUCCESS ]; then
> >     return $diagnostic
> >   fi
> >
> >   myapp_monitor
> >
> >   local state=$?
> >
> >   case $state in
> >     $OCF_SUCCESS)
> >         return $OCF_SUCCESS
> >       ;;
> >
> >     $OCF_NOT_RUNNING)
> >       myapp_launch > /dev/null 2>&1
> >       if [ $?  -eq 0 ]; then
> >         return $OCF_SUCCESS
> >       fi
> >
> >       return $OCF_ERR_GENERIC
> >       ;;
> >
> >     *)
> >       return $state
> >       ;;
> >   esac
> > }
> >
> > I know for a fact that, in one, myapp_launch gets invoked, and that
> > its exit value is 0. The function therefore returns OCF_SUCCESS, as
> > it should. However, if I understand things correctly, the log entries
> > in two seem to claim that the exit value of the script in one is
> > OCF_NOT_RUNNING.
>
> The start succeeded. It's the recurring monitor that failed.
>

  I assume that the recurring monitor invokes the myapp_monitor function
that I created at the same time as myapp_start. Well, as far as I can tell,
the problem seems to be that, as I mentioned in later posts. myapp_start
gets invoked several times when creating the resource. As a result -
although I have yet to understand it in detail - myapp_monitor occasionally
fails in that sequence. Whether the resource starts correctly or no depends
on what happens the last time myapp_monitor is invoked when creating the
resource - which is why the whole thing works every so often.

  Why is myapp_start invoked so many times on creating the resource?
Who/what is controlling this? Is it configurable?

> >
> > What's going on here? It's obviously something to do with myapp-
> > script - but, what?
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190415/921ea91f/attachment.html>