[ClusterLabs] Resource not starting correctly IV

Tue Apr 16 12:21:02 EDT 2019

Thanks to everybody who has contributed to this. Let me summarize things,
if it is only for my own benefit - I learn more quickly when I try to
explain that I am trying to learn something to others.

I instrumented my script in order to find out exactly how many times it is
invoked when creating my resource, and exactly what functions in the script
are invoked. Just as a reminder, the logs I am about to describe are
created directly as a result from executing the following command:

# pcs resource create ClusterMyApp ocf:myapp:myapp-script op monitor
interval=30s

myapp-script is always the same, and the starting conditions for the app
that it is meant to launch are always exactly the same. In all cases before
issuing the command above I made sure to delete the resource, if already
there.

What follows is a log of the way in which myapp-script was invoked as a
result of executing the command above. It consists of a series of blocks,
like the following:

  monitor:

    Status: NOT_RUNNING
    Exit: NOT_RUNNING

This block is an invocation of myapp-script with argument 'monitor'. The
'Status' line means myapp_monitor was invoked, and it returned
OCF_NOT_RUNNING.  The 'Exit' line means that myapp-script exited with
OCF_NOT_RUNNING.  In a block with more than two lines, the line immediately
preceding the 'Exit' line represents the function in the script that was
invoked as a consequence of the argument passed down to the script. The
other lines are nested function invocations, as a consequence of that.

A typical log obtained in node one would be the following:

monitor:

    Status: NOT_RUNNING
    Exit: NOT_RUNNING

start:

    Validate: SUCCESS
    Status: NOT_RUNNING
    Start: SUCCESS
    Exit: SUCCESS

monitor:

    Status: NOT_RUNNING
    Exit: NOT_RUNNING

stop:

    Validate: SUCCESS
    Status: SUCCESS
    Stop: SUCCESS
    Exit: SUCCESS

start:

    Validate: SUCCESS
    Status: NOT_RUNNING
    Start: SUCCESS
    Exit: SUCCESS

monitor:

   Status: SUCCESS
   Exit: SUCCESS

A few observations:

1. The monitor/start/stop sequence above can be repeated many times, and
the number of times it is repeated varies from one run to the next.
Occasionally, just three calls are made: monitor, start and monitor,
exiting with SUCCESS.

2. It would seem that what PaceMaker is doing is the following:
   a. Check out whether the app is running.
   b. If it is not, launch it.
   c. Check out again
   d. If running, exit.
   e. Otherwise, stop it.
    f. Launch it.
   g. Go to a.

3. In node two, the log obtained as a consequence of creating the resource
always seems to be

        monitor:

   Status: NOT_RUNNING
   Exit: NOT_RUNNING

which  makes sense to me.

4. If the above is correct, and if I am getting the picture correctly, it
would seem that the problem is that my monitoring function does not detect
immediately that my app is up and running. That's clearly my problem.
However, is there any way to get PaceMaker to introduce a delay between
steps b and c in section 2 above?

5. Following up on 4: if my script sleeps for a few seconds immediately
after launching my app (it's a daemon) in myapp_start then everything works
fine. Indeed, the call sequence in node one now becomes:

         monitor:

    Status: NOT_RUNNING
    Exit: NOT_RUNNING

          start:

    Validate: SUCCESS
    Status: NOT_RUNNING
    Start: SUCCESS
    Exit: SUCCESS

          monitor:

    Status: SUCCESS
    Exit: SUCCESS
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190416/3c46764a/attachment.html>