<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Aug 28, 2024 at 8:04 PM Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Thanks for the report, and sorry for the slow response.<br>

<br>

There is a longstanding goal to improve systemd resource monitoring by<br>

using DBus signals instead of polling the status:<br>

<br>

 <a href="https://projects.clusterlabs.org/T25" rel="noreferrer" target="_blank">https://projects.clusterlabs.org/T25</a><br>

<br>

There's a good chance that would avoid the need for the 2-second<br>

polling at start and stop as well, which would take care of the most<br>

significant problem here.<br>

<br>

I've created a task for avoiding systemd reloads when unnecessary:<br>

<br>

 <a href="https://projects.clusterlabs.org/T870" rel="noreferrer" target="_blank">https://projects.clusterlabs.org/T870</a><br>

<br>

but that's unlikely to happen, since the executor creates the overrides<br>

as each start or stop happens, and the executor has no knowledge of<br>

what other starts or stops might be planned. It would be an intrusive<br>

change to get that information where it's needed.<br></blockquote><div><br></div><div>And yes - doing a reload just on every 10th occasion is not a good solution.</div><div>We had that and I removed it years ago ;-)</div><div><br></div><div>Klaus</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

On Fri, 2024-07-05 at 12:27 +0200, Borja Macho wrote:<br>

> Hi everyone,<br>

> <br>

> I have been experiencing this issue with Pacemaker running in Debian<br>

> 12 (pacemaker 2.1.5-1+deb12u1) this is the information I found so<br>

> far:<br>

> <br>

> The reload operation takes around 0.5s to finish:<br>

> ```<br>

> root@m1:~# time systemctl daemon-reload<br>

> <br>

> real    0m0.487s<br>

> user    0m0.003s<br>

> sys     0m0.009s<br>

> ```<br>

> <br>

> Before starting the systemd resources pacemaker performs a reload (I<br>

> guess that due to the override it adds to the unit file),<br>

> when starting at the same time more than one systemd resource these<br>

> reloads stack one on top of each other, delaying the actual start of<br>

> the resources.<br>

> <br>

> In the log you can see 9 reloads (for 9 resources, wait5_2 to<br>

> wait5_10 as they start concurrently after wait5_1) each spaced by<br>

> ~0.5s: <br>

> ```<br>

> 2024-07-05T08:00:15.422759 m1 pacemaker-controld[3088077]:  notice:<br>

> Result of start operation for wait5_1 on m1: ok <br>

> 2024-07-05T08:00:15.426589 m1 pacemaker-controld[3088077]:  notice:<br>

> Requesting local execution of monitor operation for wait5_1 on m1 <br>

> 2024-07-05T08:00:15.427135 m1 pacemaker-controld[3088077]:  notice:<br>

> Requesting local execution of start operation for wait5_2 on m1 <br>

> 2024-07-05T08:00:15.427581 m1 pacemaker-controld[3088077]:  notice:<br>

> Requesting local execution of start operation for wait5_3 on m1 <br>

> 2024-07-05T08:00:15.427978 m1 pacemaker-controld[3088077]:  notice:<br>

> Requesting local execution of start operation for wait5_4 on m1 <br>

> 2024-07-05T08:00:15.428375 m1 pacemaker-controld[3088077]:  notice:<br>

> Requesting local execution of start operation for wait5_5 on m1 <br>

> 2024-07-05T08:00:15.428904 m1 pacemaker-controld[3088077]:  notice:<br>

> Requesting local execution of start operation for wait5_6 on m1 <br>

> 2024-07-05T08:00:15.429259 m1 pacemaker-controld[3088077]:  notice:<br>

> Requesting local execution of start operation for wait5_7 on m1 <br>

> 2024-07-05T08:00:15.429764 m1 pacemaker-controld[3088077]:  notice:<br>

> Result of monitor operation for wait5_1 on m1: ok <br>

> 2024-07-05T08:00:15.430060 m1 pacemaker-controld[3088077]:  notice:<br>

> Requesting local execution of start operation for wait5_8 on m1 <br>

> 2024-07-05T08:00:15.430501 m1 pacemaker-controld[3088077]:  notice:<br>

> Requesting local execution of start operation for wait5_9 on m1 <br>

> 2024-07-05T08:00:15.430999 m1 pacemaker-controld[3088077]:  notice:<br>

> Requesting local execution of start operation for wait5_10 on m1 <br>

> 2024-07-05T08:00:15.431668 m1 systemd[1]: Reloading.<br>

> 2024-07-05T08:00:15.845709 m1 systemd[1]: Reloading.<br>

> 2024-07-05T08:00:16.333209 m1 systemd[1]: Reloading.<br>

> 2024-07-05T08:00:16.857070 m1 systemd[1]: Reloading.<br>

> 2024-07-05T08:00:17.330217 m1 systemd[1]: Reloading.<br>

> 2024-07-05T08:00:17.859113 m1 systemd[1]: Reloading.<br>

> 2024-07-05T08:00:18.315644 m1 systemd[1]: Reloading.<br>

> 2024-07-05T08:00:18.749311 m1 systemd[1]: Reloading.<br>

> 2024-07-05T08:00:19.243338 m1 systemd[1]: Reloading.<br>

> 2024-07-05T08:00:19.835406 m1 systemd[1]: Starting <br>

> wait_5_to_start@10.service - Cluster Controlled wait_5_to_start@10...<br>

> ```<br>

> <br>

> That means around N times the reload time before the actual start of<br>

> the systemd unit, being N the number of concurrent systemd resources<br>

> started.<br>

> <br>

> Having a look at the code I saw a "hardcoded" maximum of 2 seconds<br>

> (execd_commands.c -> action_complete() -> delay = QB_MIN(2000,<br>

> delay);) between sending the start message to D-Bus and performing<br>

> the first check, the problem lays in that when that first check is<br>

> performed some of the systemd services have not yet been started,<br>

> thus reporting the 'inactive' previous state, as is not yet<br>

> activating.<br>

> <br>

> This delay can be seen in the re-scheduling of the monitor in the<br>

> start operation:<br>

> <br>

> First resource started (failure):<br>

> ```<br>

> Jul 05 08:02:40.975 m1 pacemaker-execd     [3088074]<br>

> (process_unit_method_reply)    debug: DBus request for start of<br>

> wait5_2 using /org/freedesktop/systemd1/job/639194 succeeded<br>

> Jul 05 08:02:40.975 m1 pacemaker-execd     [3088074]<br>

> (action_complete)              debug: wait5_2 start may still be in<br>

> progress: re-scheduling (elapsed=487ms, remaining=99513ms,<br>

> start_delay=2000ms)<br>

> ```<br>

> <br>

> Last resource started (success):<br>

> ```<br>

> Jul 05 08:02:44.747 m1 pacemaker-execd     [3088074]<br>

> (process_unit_method_reply)    debug: DBus request for start of<br>

> wait5_10 using /org/freedesktop/systemd1/job/640170 succeeded<br>

> Jul 05 08:02:44.747 m1 pacemaker-execd     [3088074]<br>

> (action_complete)              debug: wait5_10 start may still be in<br>

> progress: re-scheduling (elapsed=4259ms, remaining=95741ms,<br>

> start_delay=2000ms)<br>

> ```<br>

> <br>

> Thus any number of concurrent systemd resource starts greater than<br>

> ceil(2/reload_time) is prone to failure.<br>

> <br>

> Some extra information:<br>

> Resources that reached the activating status before those 2 seconds<br>

> ran out succeeded to start as they reported 'activating' when the<br>

> first monitor was performed:<br>

> ```<br>

> Jul 05 08:02:44.827 m1 pacemaker-execd     [3088074] (log_execute)  <br>

>                debug: executing - rsc:wait5_6 action:monitor<br>

> call_id:249<br>

> Jul 05 08:02:44.827 m1 pacemaker-execd     [3088074]<br>

> (services__execute_systemd)    debug: Performing asynchronous status<br>

> op on systemd unit wait_5_to_start@6 for resource wait5_6<br>

> Jul 05 08:02:44.831 m1 pacemaker-execd     [3088074]<br>

> (action_complete)               info: wait5_6 monitor is still in<br>

> progress: re-scheduling (elapsed=4342ms, remaining=95658ms,<br>

> start_delay=2000ms)<br>

> ```<br>

> <br>

> The systemd service used for those tests:<br>

> ```<br>

> root@m1:~# systemctl cat wait_5_to_start@.service<br>

> # /etc/systemd/system/wait_5_to_start@.service<br>

> [Unit]<br>

> Description=notify start after 5 seconds service %i<br>

> <br>

> [Service]<br>

> Type=notify<br>

> ExecStart=/usr/bin/python3 -c 'import time; import systemd.daemon;<br>

> time.sleep(5); systemd.daemon.notify("READY=1"); time.sleep(86400)'<br>

> ```<br>

> How the resources were created (and tested):<br>

> ```<br>

> # for I in $(seq 1 10); do pcs resource create wait5_$I<br>

> systemd:wait_5_to_start@$I op monitor interval="60s" timeout="100s"<br>

> op start interval="0s" timeout="100s" op stop interval="0s"<br>

> timeout="100s" --disabled; done<br>

> # for I in $(seq 2 10); do pcs constraint colocation add wait5_$I<br>

> with wait5_1 INFINITY; done<br>

> # for I in $(seq 2 10); do pcs constraint order start wait5_1 then<br>

> start wait5_$I kind=Mandatory; done<br>

> # for I in $(seq 2 10); do pcs resource enable wait5_$I; done<br>

> # pcs resource move wait5_1<br>

> ```<br>

> <br>

> Best regards,<br>

> Borja.<br>

> <br>

-- <br>

Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>><br>

<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

<br>

</blockquote></div></div>