[ClusterLabs] Antw: [EXT] Re: Stop timeout=INFINITY not working

Wed Jan 27 10:41:50 EST 2021

On Wed, 2021-01-27 at 08:29 +0100, Ulrich Windl wrote:
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 26.01.2021 um
> > > > 16:08 in
> 
> Nachricht
> <ec010c29d38846eb5f50dc627fd43a1510189f4c.camel at redhat.com>:
> > On Tue, 2021‑01‑26 at 02:12 ‑0500, Digimer wrote:
> > > Hi all,
> > > 
> > >   I created a resource with an INFINITE stop timeout;
> > > 
> > > pcs resource create srv01‑test ocf:alteeve:server
> > > name="srv01‑test"
> > > meta
> > > allow‑migrate="true" target‑role="stopped" op monitor
> > > interval="60"
> > > start timeout="INFINITY" on‑fail="block" stop timeout="INFINITY"
> > > on‑fail="block" migrate_to timeout="INFINITY"
> > 
> > I hadn't noticed this before, but it looks like INFINITY is not
> > allowed
> > in time interval specifications, and there's no log warning about
> > it.
> > :‑/
> 
> Hi!
> 
> I was wondering why someone would set a timeout to something like a
> day or
> more:
> To give the operator a chance to investigate and fix problems before
> the
> cluster tries recovery?
> 
> Regards,
> Ulrich

I can't imagine a use for it actually.

The main problem is if the command hangs, no further transitions will
be possible until it times out. And if it's hanging for even an hour
it's unlikely it will ever complete, so it just freezes the cluster.

Also, even if the operator fixes it, the command that was already
initiated will probably still time out, though I suppose that depends
on how the agent works.

I'm guessing the main reason for doing it would be during a testing
phase to essentially disable timeouts during testing. But a reasonably
high number (30 minutes? whatever's appropriate to the situation) would
be better than 49 days :)

> > Time interval specifications can be an integer number of seconds,
> > an
> > ISO 8601 duration, or a number with units (s/m/h/etc.).
> > 
> > Timeouts are stored in milliseconds as 32‑bit unsigned integers so
> > the
> > limit is a bit under 50 days (though I'd keep it well below that).
> > 
> > >   Then I tried stopping it (on a highly loaded system) and it
> > > timed
> > > out
> > > after just 20 seconds and got flagged as failed;
> > > 
> > > ====
> > > Jan 26 07:06:19 el8‑a01n01.alteeve.ca
> > > pacemaker‑controld[1846038]:
> > > notice: High CPU load detected: 3.570000
> > > Jan 26 07:06:49 el8‑a01n01.alteeve.ca
> > > pacemaker‑controld[1846038]:
> > > notice: High CPU load detected: 3.480000
> > > Jan 26 07:07:05 el8‑a01n01.alteeve.ca
> > > pacemaker‑controld[1846038]:
> > > notice: State transition S_IDLE ‑> S_POLICY_ENGINE
> > > Jan 26 07:07:05 el8‑a01n01.alteeve.ca
> > > pacemaker‑schedulerd[1846037]:
> > > notice:  * Stop       srv01‑test             (               el8‑
> > > a01n01
> > > )   due to node availability
> > > Jan 26 07:07:05 el8‑a01n01.alteeve.ca
> > > pacemaker‑schedulerd[1846037]:
> > > notice: Calculated transition 179, saving inputs in
> > > /var/lib/pacemaker/pengine/pe‑input‑76.bz2
> > > Jan 26 07:07:05 el8‑a01n01.alteeve.ca
> > > pacemaker‑controld[1846038]:
> > > notice: Initiating stop operation srv01‑test_stop_0 locally on
> > > el8‑
> > > a01n01
> > > Jan 26 07:07:19 el8‑a01n01.alteeve.ca
> > > pacemaker‑controld[1846038]:
> > > notice: High CPU load detected: 3.850000
> > > Jan 26 07:07:25 el8‑a01n01.alteeve.ca kernel: drbd srv01‑test:
> > > role(
> > > Primary ‑> Secondary )
> > > Jan 26 07:07:25 el8‑a01n01.alteeve.ca pacemaker‑execd[1846035]:
> > > warning: srv01‑test_stop_0 process (PID 2647133) timed out
> > > Jan 26 07:07:25 el8‑a01n01.alteeve.ca pacemaker‑execd[1846035]:
> > > warning: srv01‑test_stop_0[2647133] timed out after 20000ms
> > > Jan 26 07:07:25 el8‑a01n01.alteeve.ca
> > > pacemaker‑controld[1846038]:
> > > error: Result of stop operation for srv01‑test on el8‑a01n01:
> > > Timed
> > > Out
> > > Jan 26 07:07:25 el8‑a01n01.alteeve.ca
> > > pacemaker‑controld[1846038]:
> > > notice: el8‑a01n01‑srv01‑test_stop_0:89 [ The server:
> > > [srv01‑test] is
> > > indeed running. It will be shut down now.\n ]
> > > ====
> > > 
> > > Did I not configure the stop timeout correctly?
> > > 
> > > Thanks for any insight.
> > > 
> > 
> > ‑‑ 
> > Ken Gaillot <kgaillot at redhat.com>
> > 
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>