[ClusterLabs] Antw: [EXT] Re: Q: starting systemd resources

Ken Gaillot kgaillot at redhat.com
Fri Feb 5 10:47:07 EST 2021


On Fri, 2021-02-05 at 12:19 +0100, Ulrich Windl wrote:
> Hi Reid!
> 
> Thanks for the explanation. Picking the right log levels is not
> trivial, but I'd expect them to be somewhat consistent:
> If for non-systemd I see separate start and finished messages, I'd
> also expect them for systemd resources, or vice-versa.
> Otherwise it will make reading the logs harder, specifically when
> several resources are starting concurrently.
> 
> Regards,
> Ulrich
> 
> > > > Reid Wahl <nwahl at redhat.com> schrieb am 05.02.2021 um 11:08 in
> > > > Nachricht
> 
> <CAPiuu99SzKMsrxG9dw4y5+ZOFcNG3EAHpJP=zFUbcTibVzuDLw at mail.gmail.com>:
> > Hi, Ulrich. I presume you're talking about the log_finished line,
> > which
> > looks like this in pacemaker.log:
> > 
> > Feb 05 01:48:47.192 fastvm-rhel-8-0-23 pacemaker-execd     [15446]
> > (log_finished at execd_commands.c:214)  info: dummy start (call 23,
> > PID 18743)
> > exited with status 0 (execution time 11ms, queue time 0ms)
> > 
> > Is that correct?
> 
> Yes. The detail is visible in pacemaker.log only.
> 
> > 
> > I got curious and took a look. It looks like this comes down to the
> > action_complete() function:
> >   -
> > https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.5/daemons/execd/ 
> > execd_commands.c#L896-L1050
> > 
> > This function calls cmd_finalize() at the end, which calls a chain
> > of
> > functions ending in log_finished(). log_finished() logs the "start
> > ...
> > exited with status" line shown above.
> > 
> > However, systemd resources have to be handled a bit differently. We
> > have to
> > schedule a monitor operation to run after the start operation, in
> > order to
> > check whether the start was truly a success.
> 
> Doesn't systemctl return a proper exit status?

It does, but we don't use systemctl, we use the systemd C library
interface. And unfortunately, our current implementation does not use
it optimally.

The current design is based on initiating the start, then polling the
the status periodically until it reports active.

A better design would be to initiate the start, then listen for systemd
DBus signals that report when the start is complete. However that would
be a significant redesign, and developer time is not available for it.

> > ~~~
> >     if (pcmk__str_eq(rclass, PCMK_RESOURCE_CLASS_SYSTEMD,
> > pcmk__str_casei))
> > {
> >         if ((cmd->exec_rc == PCMK_OCF_OK)
> >             && pcmk__strcase_any_of(cmd->action, "start", "stop",
> > NULL)) {
> >             /* systemd returns from start and stop actions after
> > the action
> >              * begins, not after it completes. We have to jump
> > through a few
> >              * hoops so that we don't report 'complete' to the rest
> > of
> > pacemaker
> >              * until it's actually done.
> >              */
> >             goagain = true;
> >             cmd->real_action = cmd->action;
> >             cmd->action = strdup("monitor");
> > ...
> >     if (goagain) {
> >     ...
> >             schedule_lrmd_cmd(rsc, cmd);
> > 
> >             /* Don't finalize cmd, we're not done with it yet */
> >             return;
> > ~~~
> > 
> > So for the start operation, it never reaches the cmd_finalize()
> > call at the
> > end, until the follow-up monitor runs. The follow-up monitor
> > operation does
> > end up calling cmd_finalize() at the end of action_complete. But
> > the
> > "log_finished" message is logged at debug level for monitor
> > operations. So
> > you won't see it unless debugging is enabled.
> > 
> > Does this make sense?
> 
> Yes.
> 
> > 
> > Example:
> > ~~~
> > Feb 05 02:06:51.123 fastvm-rhel-8-0-23 pacemaker-execd     [19354]
> > (log_execute)        info: executing - rsc:nfs-daemon action:start
> > call_id:20
> > Feb 05 02:06:51.123 fastvm-rhel-8-0-23 pacemaker-execd     [19354]
> > (systemd_unit_exec)  debug: Performing asynchronous start op on
> > systemd
> > unit nfs-server.service named 'nfs-daemon'
> > Feb 05 02:06:51.124 fastvm-rhel-8-0-23 pacemaker-execd     [19354]
> > (systemd_unit_exec_with_unit)        debug: Calling StartUnit for
> > nfs-daemon: /org/freedesktop/systemd1/unit/nfs_2dserver_2eservice
> > Feb 05 02:06:51.517 fastvm-rhel-8-0-23 pacemaker-execd     [19354]
> > (action_complete)    debug: nfs-daemon start may still be in
> > progress:
> > re-scheduling (elapsed=394ms, remaining=99606ms,
> > start_delay=2000ms)
> > Feb 05 02:06:53.518 fastvm-rhel-8-0-23 pacemaker-execd     [19354]
> > (log_execute)        debug: executing - rsc:nfs-daemon
> > action:monitor
> > call_id:20
> > Feb 05 02:06:53.518 fastvm-rhel-8-0-23 pacemaker-execd     [19354]
> > (systemd_unit_exec)  debug: Performing asynchronous status op on
> > systemd
> > unit nfs-server.service named 'nfs-daemon'
> > Feb 05 02:06:53.521 fastvm-rhel-8-0-23 pacemaker-execd     [19354]
> > (action_complete)    debug: nfs-daemon systemd start is now
> > complete
> > (elapsed=2397ms, remaining=97603ms): ok (0)
> > Feb 05 02:06:53.521 fastvm-rhel-8-0-23 pacemaker-execd     [19354]
> > (log_finished)       debug: nfs-daemon monitor (call 20) exited
> > with status
> > 0 (execution time 2397ms, queue time 0ms)
> > ~~~
> > 
> > 
> > 
> > On Thu, Feb 4, 2021 at 11:25 PM Ulrich Windl <
> > Ulrich.Windl at rz.uni-regensburg.de> wrote:
> > 
> > > Hi!
> > > 
> > > While analyzing cluster problems I noticed this:
> > > Normal resources executed via OCF RAs create two log entries by
> > > pacemaker-execd: One when starting the resource and another when
> > > the
> > > resource completed starting.
> > > However for systemd units I only get a start message. Is that
> > > intentional?
> > > Does that mean systemd starts are asynchronous in general (i.e.:
> > > The
> > > process returns before start in complete)?
> > > (Still I get a completed message from pacemaker-controld)
> > > 
> > > Example:
> > > Feb 04 15:41:25 h19 pacemaker-execd[7793]:  notice: executing -
> > > rsc:prm_virtlockd action:start call_id:95
> > > Feb 04 15:41:27 h19 pacemaker-execd[7793]:  notice: executing -
> > > rsc:prm_libvirtd action:start call_id:97
> > > 
> > > So one could guess that lirtlockd and libvirtd are staring
> > > concurrently,
> > > but the did not because of this sequence:
> > > Feb 04 15:41:27 h19 pacemaker-controld[7796]:  notice: Result of
> > > start
> > > operation for prm_virtlockd on h19: ok
> > > Feb 04 15:41:27 h19 pacemaker-execd[7793]:  notice: executing -
> > > rsc:prm_libvirtd action:start call_id:97
> > > 
> > > Regards,
> > > Ulrich
> > > 
> > > 
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users 
> > > 
> > > ClusterLabs home: https://www.clusterlabs.org/ 
> > > 
> > > 
> > 
> > -- 
> > Regards,
> > 
> > Reid Wahl, RHCA
> > Senior Software Maintenance Engineer, Red Hat
> > CEE - Platform Support Delivery - ClusterHA
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list