[ClusterLabs] Batch start (approx > 5) of systemd resources fail (exitreason='inactive')

Fri Jul 5 10:27:55 UTC 2024

Hi everyone,

I have been experiencing this issue with Pacemaker running in Debian 12
(pacemaker 2.1.5-1+deb12u1) this is the information I found so far:

The reload operation takes around 0.5s to finish:
```
root at m1:~# time systemctl daemon-reload

real    0m0.487s
user    0m0.003s
sys     0m0.009s
```

Before starting the systemd resources pacemaker performs a reload (I guess
that due to the override it adds to the unit file), when starting at the
same time more than one systemd resource these reloads stack one on top of
each other, delaying the actual start of the resources.

In the log you can see 9 reloads (for 9 resources, wait5_2 to wait5_10 as
they start concurrently after wait5_1) each spaced by ~0.5s:
```
2024-07-05T08:00:15.422759 m1 pacemaker-controld[3088077]:  notice: Result
of start operation for wait5_1 on m1: ok
2024-07-05T08:00:15.426589 m1 pacemaker-controld[3088077]:  notice:
Requesting local execution of monitor operation for wait5_1 on m1
2024-07-05T08:00:15.427135 m1 pacemaker-controld[3088077]:  notice:
Requesting local execution of start operation for wait5_2 on m1
2024-07-05T08:00:15.427581 m1 pacemaker-controld[3088077]:  notice:
Requesting local execution of start operation for wait5_3 on m1
2024-07-05T08:00:15.427978 m1 pacemaker-controld[3088077]:  notice:
Requesting local execution of start operation for wait5_4 on m1
2024-07-05T08:00:15.428375 m1 pacemaker-controld[3088077]:  notice:
Requesting local execution of start operation for wait5_5 on m1
2024-07-05T08:00:15.428904 m1 pacemaker-controld[3088077]:  notice:
Requesting local execution of start operation for wait5_6 on m1
2024-07-05T08:00:15.429259 m1 pacemaker-controld[3088077]:  notice:
Requesting local execution of start operation for wait5_7 on m1
2024-07-05T08:00:15.429764 m1 pacemaker-controld[3088077]:  notice: Result
of monitor operation for wait5_1 on m1: ok
2024-07-05T08:00:15.430060 m1 pacemaker-controld[3088077]:  notice:
Requesting local execution of start operation for wait5_8 on m1
2024-07-05T08:00:15.430501 m1 pacemaker-controld[3088077]:  notice:
Requesting local execution of start operation for wait5_9 on m1
2024-07-05T08:00:15.430999 m1 pacemaker-controld[3088077]:  notice:
Requesting local execution of start operation for wait5_10 on m1
2024-07-05T08:00:15.431668 m1 systemd[1]: Reloading.
2024-07-05T08:00:15.845709 m1 systemd[1]: Reloading.
2024-07-05T08:00:16.333209 m1 systemd[1]: Reloading.
2024-07-05T08:00:16.857070 m1 systemd[1]: Reloading.
2024-07-05T08:00:17.330217 m1 systemd[1]: Reloading.
2024-07-05T08:00:17.859113 m1 systemd[1]: Reloading.
2024-07-05T08:00:18.315644 m1 systemd[1]: Reloading.
2024-07-05T08:00:18.749311 m1 systemd[1]: Reloading.
2024-07-05T08:00:19.243338 m1 systemd[1]: Reloading.
2024-07-05T08:00:19.835406 m1 systemd[1]: Starting
wait_5_to_start at 10.service - Cluster Controlled wait_5_to_start at 10...
```

That means around N times the reload time before the actual start of the
systemd unit, being N the number of concurrent systemd resources started.

Having a look at the code I saw a "hardcoded" maximum of 2 seconds
(execd_commands.c -> action_complete() -> delay = QB_MIN(2000, delay);)
between sending the start message to D-Bus and performing the first check,
the problem lays in that when that first check is performed some of the
systemd services have not yet been started, thus reporting the 'inactive'
previous state, as is not yet activating.

This delay can be seen in the re-scheduling of the monitor in the start
operation:

First resource started (failure):
```
Jul 05 08:02:40.975 m1 pacemaker-execd     [3088074]
(process_unit_method_reply)    debug: DBus request for start of wait5_2
using /org/freedesktop/systemd1/job/639194 succeeded
Jul 05 08:02:40.975 m1 pacemaker-execd     [3088074] (action_complete)
         debug: wait5_2 start may still be in progress: re-scheduling
(elapsed=487ms, remaining=99513ms, start_delay=2000ms)
```

Last resource started (success):
```
Jul 05 08:02:44.747 m1 pacemaker-execd     [3088074]
(process_unit_method_reply)    debug: DBus request for start of wait5_10
using /org/freedesktop/systemd1/job/640170 succeeded
Jul 05 08:02:44.747 m1 pacemaker-execd     [3088074] (action_complete)
         debug: wait5_10 start may still be in progress: re-scheduling
(elapsed=4259ms, remaining=95741ms, start_delay=2000ms)
```

Thus any number of concurrent systemd resource starts greater than
ceil(2/reload_time) is prone to failure.

Some extra information:
Resources that reached the activating status before those 2 seconds ran out
succeeded to start as they reported 'activating' when the first monitor was
performed:
```
Jul 05 08:02:44.827 m1 pacemaker-execd     [3088074] (log_execute)
         debug: executing - rsc:wait5_6 action:monitor call_id:249
Jul 05 08:02:44.827 m1 pacemaker-execd     [3088074]
(services__execute_systemd)    debug: Performing asynchronous status op on
systemd unit wait_5_to_start at 6 for resource wait5_6
Jul 05 08:02:44.831 m1 pacemaker-execd     [3088074] (action_complete)
          info: wait5_6 monitor is still in progress: re-scheduling
(elapsed=4342ms, remaining=95658ms, start_delay=2000ms)
```

The systemd service used for those tests:
```
root at m1:~# systemctl cat wait_5_to_start at .service
# /etc/systemd/system/wait_5_to_start at .service
[Unit]
Description=notify start after 5 seconds service %i

[Service]
Type=notify
ExecStart=/usr/bin/python3 -c 'import time; import systemd.daemon;
time.sleep(5); systemd.daemon.notify("READY=1"); time.sleep(86400)'
```
How the resources were created (and tested):
```
# for I in $(seq 1 10); do pcs resource create wait5_$I
systemd:wait_5_to_start@$I op monitor interval="60s" timeout="100s" op
start interval="0s" timeout="100s" op stop interval="0s" timeout="100s"
--disabled; done
# for I in $(seq 2 10); do pcs constraint colocation add wait5_$I with
wait5_1 INFINITY; done
# for I in $(seq 2 10); do pcs constraint order start wait5_1 then start
wait5_$I kind=Mandatory; done
# for I in $(seq 2 10); do pcs resource enable wait5_$I; done
# pcs resource move wait5_1
```

Best regards,
Borja.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20240705/b273d934/attachment-0001.htm>