[ClusterLabs] Resource not starting correctly IV

Jan Pokorný jpokorny at redhat.com
Tue Apr 16 15:46:09 EDT 2019


[letter-casing wise:
 it's either "Pacemaker" or down-to-the-terminal "pacemaker"]

On 16/04/19 10:21 -0600, JCA wrote:
> 2. It would seem that what Pacemaker is doing is the following:
>    a. Check out whether the app is running.
>    b. If it is not, launch it.
>    c. Check out again
>    d. If running, exit.
>    e. Otherwise, stop it.
>     f. Launch it.
>    g. Go to a.
> 
> [...]
> 
> 4. If the above is correct, and if I am getting the picture correctly, it
> would seem that the problem is that my monitoring function does not detect
> immediately that my app is up and running. That's clearly my problem.
> However, is there any way to get Pacemaker to introduce a delay between
> steps b and c in section 2 above?

Ah, it should have occurred to me!

Typical solution, I think, is to have a sleep loop following the
daemon launch within "start" action that will run (subset) of what
"monitor" normally does, so as to synchronize on the "service ready"
moment.  Default timeout for "start" within agent's metadata should
then reflect the common time to get to the point "monitor" is happy
plus some reserve.

Some agents may do more elaborate things like precisely limiting such
waiting in respect to the time they were actually given by the
resource manager/pacemaker (if I don't misremember, that value is
provided through environment variables for sort of an introspection).

Resource agent experts could advise here.

(Truth to be told, "daemon readiness" used to be a very marginalized
problem putting barriers to practical [= race-free] dependency ordering
etc., luckily clever people realized that the most precize tracking
can only be at the hands of the actual daemon implementors if event
driven paradigm is to be applied.  For instance, if you can influence
my_app, and it's a standard forking daemon, it would be best if the
parent exited only when the daemon is truly ready to provide service
-- this usually requires some typically signal-based synchronization
amongst the daemon processes.  With systemd, situation is much simpler
since no forking is necessary, just a call to sd_notify(3) -- in that
case, though, your agent would need to mimic the server side of the
sd_notify protocol since nothing would do it for you.)

> 5. Following up on 4: if my script sleeps for a few seconds immediately
> after launching my app (it's a daemon) in myapp_start then everything works
> fine. Indeed, the call sequence in node one now becomes:
> 
>          monitor:
> 
>     Status: NOT_RUNNING
>     Exit: NOT_RUNNING
> 
>           start:
> 
>     Validate: SUCCESS
>     Status: NOT_RUNNING
>     Start: SUCCESS
>     Exit: SUCCESS
> 
>           monitor:
> 
>     Status: SUCCESS
>     Exit: SUCCESS

That's easier but less effective and reliable (more opportunistic than
fact-based) than polling the "monitor" outcomes privately within "start"
as sketched above.

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190416/39bbee37/attachment-0001.sig>


More information about the Users mailing list