[ClusterLabs] Batch start (approx > 5) of systemd resources fail (exitreason='inactive')

Thu Aug 29 03:34:02 UTC 2024

On Wed, Aug 28, 2024 at 8:04 PM Ken Gaillot <kgaillot at redhat.com> wrote:

> Thanks for the report, and sorry for the slow response.
>
> There is a longstanding goal to improve systemd resource monitoring by
> using DBus signals instead of polling the status:
>
>  https://projects.clusterlabs.org/T25
>
> There's a good chance that would avoid the need for the 2-second
> polling at start and stop as well, which would take care of the most
> significant problem here.
>
> I've created a task for avoiding systemd reloads when unnecessary:
>
>  https://projects.clusterlabs.org/T870
>
> but that's unlikely to happen, since the executor creates the overrides
> as each start or stop happens, and the executor has no knowledge of
> what other starts or stops might be planned. It would be an intrusive
> change to get that information where it's needed.
>

And yes - doing a reload just on every 10th occasion is not a good solution.
We had that and I removed it years ago ;-)

Klaus

>
> On Fri, 2024-07-05 at 12:27 +0200, Borja Macho wrote:
> > Hi everyone,
> >
> > I have been experiencing this issue with Pacemaker running in Debian
> > 12 (pacemaker 2.1.5-1+deb12u1) this is the information I found so
> > far:
> >
> > The reload operation takes around 0.5s to finish:
> > ```
> > root at m1:~# time systemctl daemon-reload
> >
> > real    0m0.487s
> > user    0m0.003s
> > sys     0m0.009s
> > ```
> >
> > Before starting the systemd resources pacemaker performs a reload (I
> > guess that due to the override it adds to the unit file),
> > when starting at the same time more than one systemd resource these
> > reloads stack one on top of each other, delaying the actual start of
> > the resources.
> >
> > In the log you can see 9 reloads (for 9 resources, wait5_2 to
> > wait5_10 as they start concurrently after wait5_1) each spaced by
> > ~0.5s:
> > ```
> > 2024-07-05T08:00:15.422759 m1 pacemaker-controld[3088077]:  notice:
> > Result of start operation for wait5_1 on m1: ok
> > 2024-07-05T08:00:15.426589 m1 pacemaker-controld[3088077]:  notice:
> > Requesting local execution of monitor operation for wait5_1 on m1
> > 2024-07-05T08:00:15.427135 m1 pacemaker-controld[3088077]:  notice:
> > Requesting local execution of start operation for wait5_2 on m1
> > 2024-07-05T08:00:15.427581 m1 pacemaker-controld[3088077]:  notice:
> > Requesting local execution of start operation for wait5_3 on m1
> > 2024-07-05T08:00:15.427978 m1 pacemaker-controld[3088077]:  notice:
> > Requesting local execution of start operation for wait5_4 on m1
> > 2024-07-05T08:00:15.428375 m1 pacemaker-controld[3088077]:  notice:
> > Requesting local execution of start operation for wait5_5 on m1
> > 2024-07-05T08:00:15.428904 m1 pacemaker-controld[3088077]:  notice:
> > Requesting local execution of start operation for wait5_6 on m1
> > 2024-07-05T08:00:15.429259 m1 pacemaker-controld[3088077]:  notice:
> > Requesting local execution of start operation for wait5_7 on m1
> > 2024-07-05T08:00:15.429764 m1 pacemaker-controld[3088077]:  notice:
> > Result of monitor operation for wait5_1 on m1: ok
> > 2024-07-05T08:00:15.430060 m1 pacemaker-controld[3088077]:  notice:
> > Requesting local execution of start operation for wait5_8 on m1
> > 2024-07-05T08:00:15.430501 m1 pacemaker-controld[3088077]:  notice:
> > Requesting local execution of start operation for wait5_9 on m1
> > 2024-07-05T08:00:15.430999 m1 pacemaker-controld[3088077]:  notice:
> > Requesting local execution of start operation for wait5_10 on m1
> > 2024-07-05T08:00:15.431668 m1 systemd[1]: Reloading.
> > 2024-07-05T08:00:15.845709 m1 systemd[1]: Reloading.
> > 2024-07-05T08:00:16.333209 m1 systemd[1]: Reloading.
> > 2024-07-05T08:00:16.857070 m1 systemd[1]: Reloading.
> > 2024-07-05T08:00:17.330217 m1 systemd[1]: Reloading.
> > 2024-07-05T08:00:17.859113 m1 systemd[1]: Reloading.
> > 2024-07-05T08:00:18.315644 m1 systemd[1]: Reloading.
> > 2024-07-05T08:00:18.749311 m1 systemd[1]: Reloading.
> > 2024-07-05T08:00:19.243338 m1 systemd[1]: Reloading.
> > 2024-07-05T08:00:19.835406 m1 systemd[1]: Starting
> > wait_5_to_start at 10.service - Cluster Controlled wait_5_to_start at 10...
> > ```
> >
> > That means around N times the reload time before the actual start of
> > the systemd unit, being N the number of concurrent systemd resources
> > started.
> >
> > Having a look at the code I saw a "hardcoded" maximum of 2 seconds
> > (execd_commands.c -> action_complete() -> delay = QB_MIN(2000,
> > delay);) between sending the start message to D-Bus and performing
> > the first check, the problem lays in that when that first check is
> > performed some of the systemd services have not yet been started,
> > thus reporting the 'inactive' previous state, as is not yet
> > activating.
> >
> > This delay can be seen in the re-scheduling of the monitor in the
> > start operation:
> >
> > First resource started (failure):
> > ```
> > Jul 05 08:02:40.975 m1 pacemaker-execd     [3088074]
> > (process_unit_method_reply)    debug: DBus request for start of
> > wait5_2 using /org/freedesktop/systemd1/job/639194 succeeded
> > Jul 05 08:02:40.975 m1 pacemaker-execd     [3088074]
> > (action_complete)              debug: wait5_2 start may still be in
> > progress: re-scheduling (elapsed=487ms, remaining=99513ms,
> > start_delay=2000ms)
> > ```
> >
> > Last resource started (success):
> > ```
> > Jul 05 08:02:44.747 m1 pacemaker-execd     [3088074]
> > (process_unit_method_reply)    debug: DBus request for start of
> > wait5_10 using /org/freedesktop/systemd1/job/640170 succeeded
> > Jul 05 08:02:44.747 m1 pacemaker-execd     [3088074]
> > (action_complete)              debug: wait5_10 start may still be in
> > progress: re-scheduling (elapsed=4259ms, remaining=95741ms,
> > start_delay=2000ms)
> > ```
> >
> > Thus any number of concurrent systemd resource starts greater than
> > ceil(2/reload_time) is prone to failure.
> >
> > Some extra information:
> > Resources that reached the activating status before those 2 seconds
> > ran out succeeded to start as they reported 'activating' when the
> > first monitor was performed:
> > ```
> > Jul 05 08:02:44.827 m1 pacemaker-execd     [3088074] (log_execute)
> >                debug: executing - rsc:wait5_6 action:monitor
> > call_id:249
> > Jul 05 08:02:44.827 m1 pacemaker-execd     [3088074]
> > (services__execute_systemd)    debug: Performing asynchronous status
> > op on systemd unit wait_5_to_start at 6 for resource wait5_6
> > Jul 05 08:02:44.831 m1 pacemaker-execd     [3088074]
> > (action_complete)               info: wait5_6 monitor is still in
> > progress: re-scheduling (elapsed=4342ms, remaining=95658ms,
> > start_delay=2000ms)
> > ```
> >
> > The systemd service used for those tests:
> > ```
> > root at m1:~# systemctl cat wait_5_to_start at .service
> > # /etc/systemd/system/wait_5_to_start at .service
> > [Unit]
> > Description=notify start after 5 seconds service %i
> >
> > [Service]
> > Type=notify
> > ExecStart=/usr/bin/python3 -c 'import time; import systemd.daemon;
> > time.sleep(5); systemd.daemon.notify("READY=1"); time.sleep(86400)'
> > ```
> > How the resources were created (and tested):
> > ```
> > # for I in $(seq 1 10); do pcs resource create wait5_$I
> > systemd:wait_5_to_start@$I op monitor interval="60s" timeout="100s"
> > op start interval="0s" timeout="100s" op stop interval="0s"
> > timeout="100s" --disabled; done
> > # for I in $(seq 2 10); do pcs constraint colocation add wait5_$I
> > with wait5_1 INFINITY; done
> > # for I in $(seq 2 10); do pcs constraint order start wait5_1 then
> > start wait5_$I kind=Mandatory; done
> > # for I in $(seq 2 10); do pcs resource enable wait5_$I; done
> > # pcs resource move wait5_1
> > ```
> >
> > Best regards,
> > Borja.
> >
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20240829/3b2ad0ee/attachment-0001.htm>