<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Aug 28, 2024 at 8:04 PM Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Thanks for the report, and sorry for the slow response.<br>
<br>
There is a longstanding goal to improve systemd resource monitoring by<br>
using DBus signals instead of polling the status:<br>
<br>
<a href="https://projects.clusterlabs.org/T25" rel="noreferrer" target="_blank">https://projects.clusterlabs.org/T25</a><br>
<br>
There's a good chance that would avoid the need for the 2-second<br>
polling at start and stop as well, which would take care of the most<br>
significant problem here.<br>
<br>
I've created a task for avoiding systemd reloads when unnecessary:<br>
<br>
<a href="https://projects.clusterlabs.org/T870" rel="noreferrer" target="_blank">https://projects.clusterlabs.org/T870</a><br>
<br>
but that's unlikely to happen, since the executor creates the overrides<br>
as each start or stop happens, and the executor has no knowledge of<br>
what other starts or stops might be planned. It would be an intrusive<br>
change to get that information where it's needed.<br></blockquote><div><br></div><div>And yes - doing a reload just on every 10th occasion is not a good solution.</div><div>We had that and I removed it years ago ;-)</div><div><br></div><div>Klaus</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
On Fri, 2024-07-05 at 12:27 +0200, Borja Macho wrote:<br>
> Hi everyone,<br>
> <br>
> I have been experiencing this issue with Pacemaker running in Debian<br>
> 12 (pacemaker 2.1.5-1+deb12u1) this is the information I found so<br>
> far:<br>
> <br>
> The reload operation takes around 0.5s to finish:<br>
> ```<br>
> root@m1:~# time systemctl daemon-reload<br>
> <br>
> real 0m0.487s<br>
> user 0m0.003s<br>
> sys 0m0.009s<br>
> ```<br>
> <br>
> Before starting the systemd resources pacemaker performs a reload (I<br>
> guess that due to the override it adds to the unit file),<br>
> when starting at the same time more than one systemd resource these<br>
> reloads stack one on top of each other, delaying the actual start of<br>
> the resources.<br>
> <br>
> In the log you can see 9 reloads (for 9 resources, wait5_2 to<br>
> wait5_10 as they start concurrently after wait5_1) each spaced by<br>
> ~0.5s: <br>
> ```<br>
> 2024-07-05T08:00:15.422759 m1 pacemaker-controld[3088077]: notice:<br>
> Result of start operation for wait5_1 on m1: ok <br>
> 2024-07-05T08:00:15.426589 m1 pacemaker-controld[3088077]: notice:<br>
> Requesting local execution of monitor operation for wait5_1 on m1 <br>
> 2024-07-05T08:00:15.427135 m1 pacemaker-controld[3088077]: notice:<br>
> Requesting local execution of start operation for wait5_2 on m1 <br>
> 2024-07-05T08:00:15.427581 m1 pacemaker-controld[3088077]: notice:<br>
> Requesting local execution of start operation for wait5_3 on m1 <br>
> 2024-07-05T08:00:15.427978 m1 pacemaker-controld[3088077]: notice:<br>
> Requesting local execution of start operation for wait5_4 on m1 <br>
> 2024-07-05T08:00:15.428375 m1 pacemaker-controld[3088077]: notice:<br>
> Requesting local execution of start operation for wait5_5 on m1 <br>
> 2024-07-05T08:00:15.428904 m1 pacemaker-controld[3088077]: notice:<br>
> Requesting local execution of start operation for wait5_6 on m1 <br>
> 2024-07-05T08:00:15.429259 m1 pacemaker-controld[3088077]: notice:<br>
> Requesting local execution of start operation for wait5_7 on m1 <br>
> 2024-07-05T08:00:15.429764 m1 pacemaker-controld[3088077]: notice:<br>
> Result of monitor operation for wait5_1 on m1: ok <br>
> 2024-07-05T08:00:15.430060 m1 pacemaker-controld[3088077]: notice:<br>
> Requesting local execution of start operation for wait5_8 on m1 <br>
> 2024-07-05T08:00:15.430501 m1 pacemaker-controld[3088077]: notice:<br>
> Requesting local execution of start operation for wait5_9 on m1 <br>
> 2024-07-05T08:00:15.430999 m1 pacemaker-controld[3088077]: notice:<br>
> Requesting local execution of start operation for wait5_10 on m1 <br>
> 2024-07-05T08:00:15.431668 m1 systemd[1]: Reloading.<br>
> 2024-07-05T08:00:15.845709 m1 systemd[1]: Reloading.<br>
> 2024-07-05T08:00:16.333209 m1 systemd[1]: Reloading.<br>
> 2024-07-05T08:00:16.857070 m1 systemd[1]: Reloading.<br>
> 2024-07-05T08:00:17.330217 m1 systemd[1]: Reloading.<br>
> 2024-07-05T08:00:17.859113 m1 systemd[1]: Reloading.<br>
> 2024-07-05T08:00:18.315644 m1 systemd[1]: Reloading.<br>
> 2024-07-05T08:00:18.749311 m1 systemd[1]: Reloading.<br>
> 2024-07-05T08:00:19.243338 m1 systemd[1]: Reloading.<br>
> 2024-07-05T08:00:19.835406 m1 systemd[1]: Starting <br>
> wait_5_to_start@10.service - Cluster Controlled wait_5_to_start@10...<br>
> ```<br>
> <br>
> That means around N times the reload time before the actual start of<br>
> the systemd unit, being N the number of concurrent systemd resources<br>
> started.<br>
> <br>
> Having a look at the code I saw a "hardcoded" maximum of 2 seconds<br>
> (execd_commands.c -> action_complete() -> delay = QB_MIN(2000,<br>
> delay);) between sending the start message to D-Bus and performing<br>
> the first check, the problem lays in that when that first check is<br>
> performed some of the systemd services have not yet been started,<br>
> thus reporting the 'inactive' previous state, as is not yet<br>
> activating.<br>
> <br>
> This delay can be seen in the re-scheduling of the monitor in the<br>
> start operation:<br>
> <br>
> First resource started (failure):<br>
> ```<br>
> Jul 05 08:02:40.975 m1 pacemaker-execd [3088074]<br>
> (process_unit_method_reply) debug: DBus request for start of<br>
> wait5_2 using /org/freedesktop/systemd1/job/639194 succeeded<br>
> Jul 05 08:02:40.975 m1 pacemaker-execd [3088074]<br>
> (action_complete) debug: wait5_2 start may still be in<br>
> progress: re-scheduling (elapsed=487ms, remaining=99513ms,<br>
> start_delay=2000ms)<br>
> ```<br>
> <br>
> Last resource started (success):<br>
> ```<br>
> Jul 05 08:02:44.747 m1 pacemaker-execd [3088074]<br>
> (process_unit_method_reply) debug: DBus request for start of<br>
> wait5_10 using /org/freedesktop/systemd1/job/640170 succeeded<br>
> Jul 05 08:02:44.747 m1 pacemaker-execd [3088074]<br>
> (action_complete) debug: wait5_10 start may still be in<br>
> progress: re-scheduling (elapsed=4259ms, remaining=95741ms,<br>
> start_delay=2000ms)<br>
> ```<br>
> <br>
> Thus any number of concurrent systemd resource starts greater than<br>
> ceil(2/reload_time) is prone to failure.<br>
> <br>
> Some extra information:<br>
> Resources that reached the activating status before those 2 seconds<br>
> ran out succeeded to start as they reported 'activating' when the<br>
> first monitor was performed:<br>
> ```<br>
> Jul 05 08:02:44.827 m1 pacemaker-execd [3088074] (log_execute) <br>
> debug: executing - rsc:wait5_6 action:monitor<br>
> call_id:249<br>
> Jul 05 08:02:44.827 m1 pacemaker-execd [3088074]<br>
> (services__execute_systemd) debug: Performing asynchronous status<br>
> op on systemd unit wait_5_to_start@6 for resource wait5_6<br>
> Jul 05 08:02:44.831 m1 pacemaker-execd [3088074]<br>
> (action_complete) info: wait5_6 monitor is still in<br>
> progress: re-scheduling (elapsed=4342ms, remaining=95658ms,<br>
> start_delay=2000ms)<br>
> ```<br>
> <br>
> The systemd service used for those tests:<br>
> ```<br>
> root@m1:~# systemctl cat wait_5_to_start@.service<br>
> # /etc/systemd/system/wait_5_to_start@.service<br>
> [Unit]<br>
> Description=notify start after 5 seconds service %i<br>
> <br>
> [Service]<br>
> Type=notify<br>
> ExecStart=/usr/bin/python3 -c 'import time; import systemd.daemon;<br>
> time.sleep(5); systemd.daemon.notify("READY=1"); time.sleep(86400)'<br>
> ```<br>
> How the resources were created (and tested):<br>
> ```<br>
> # for I in $(seq 1 10); do pcs resource create wait5_$I<br>
> systemd:wait_5_to_start@$I op monitor interval="60s" timeout="100s"<br>
> op start interval="0s" timeout="100s" op stop interval="0s"<br>
> timeout="100s" --disabled; done<br>
> # for I in $(seq 2 10); do pcs constraint colocation add wait5_$I<br>
> with wait5_1 INFINITY; done<br>
> # for I in $(seq 2 10); do pcs constraint order start wait5_1 then<br>
> start wait5_$I kind=Mandatory; done<br>
> # for I in $(seq 2 10); do pcs resource enable wait5_$I; done<br>
> # pcs resource move wait5_1<br>
> ```<br>
> <br>
> Best regards,<br>
> Borja.<br>
> <br>
-- <br>
Ken Gaillot <<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>><br>
<br>
_______________________________________________<br>
Manage your subscription:<br>
<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
<br>
ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
<br>
</blockquote></div></div>