[ClusterLabs] Antw: [EXT] Re: cluster-recheck-interval and failure-timeout

Tue Apr 6 09:58:51 EDT 2021

On Tue, 2021-04-06 at 09:15 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 31.03.2021 um
> > > > 15:48 in
> 
> Nachricht
> <7dfc7c46442db17d9645854081f1269261518f84.camel at redhat.com>:
> > On Wed, 2021‑03‑31 at 14:32 +0200, Antony Stone wrote:
> > > Hi.
> > > 
> > > I'm trying to understand what looks to me like incorrect
> > > behaviour
> > > between 
> > > cluster‑recheck‑interval and failure‑timeout, under pacemaker
> > > 2.0.1
> > > 
> > > I have three machines in a corosync (3.0.1 if it matters)
> > > cluster,
> > > managing 12 
> > > resources in a single group.
> > > 
> > > I'm following documentation from:
> > > 
> > > https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/2.0/html/ 
> > > Pacemaker_Explained/s‑cluster‑options.html
> > > 
> > > and
> > > 
> > > https://clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/2.0/html/ 
> > > Pacemaker_Explained/s‑resource‑options.html
> > > 
> > > I have set a cluster property:
> > > 
> > > 	cluster‑recheck‑interval=60s
> > > 
> > > I have set a resource property:
> > > 
> > > 	failure‑timeout=180
> > > 
> > > The docs say failure‑timeout is "How many seconds to wait before
> > > acting as if 
> > > the failure had not occurred, and potentially allowing the
> > > resource
> > > back to 
> > > the node on which it failed."
> > > 
> > > I think this should mean that if the resource fails and gets
> > > restarted, the 
> > > fact that it failed will be "forgotten" after 180 seconds (or
> > > maybe a
> > > little 
> > > longer, depending on exactly when the next cluster recheck is
> > > done).
> > > 
> > > However what I'm seeing is that if the resource fails and gets
> > > restarted, and 
> > > this then happens an hour later, it's still counted as two
> > > failures.  If it 
> > 
> > That is exactly correct.
> > 
> > > fails and gets restarted another hour after that, it's recorded
> > > as
> > > three 
> > > failures and (because I have "migration‑threshold=3") it gets
> > > moved
> > > to another 
> > > node (and therefore all the other resources in group are moved as
> > > well).
> > > 
> > > So, what am I misunderstanding about "failure‑timeout", and what
> > > configuration 
> > > setting do I need to use to tell pacemaker that "provided the
> > > resource hasn't 
> > > failed within the past X seconds, forget the fact that it failed
> > > more
> > > than X 
> > > seconds ago"?
> > 
> > Unfortunately, there is no way. failure‑timeout expires *all*
> > failures
> > once the *most recent* is that old. It's a bit counter‑intuitive
> > but
> > currently, Pacemaker only remembers a resource's most recent
> > failure
> > and the total count of failures, and changing that would be a big
> > project.
> 
> Hi!
> 
> Sorry I don't get it: If you have a timestamp for each failure-
> timeout, what's
> so hard to put all the fail counts that are older than failure-
> timeout on a
> list, and then reset that list to zero?

That's exactly the issue -- we don't have a timestamp for each failure.
Only the most recent failed operation, and the total fail count (per
resource and operation), are stored in the CIB status.

We could store all failures in the CIB, but that would be a significant
project, and we'd need new options to keep the current behavior as the
default.

> I mean: That would be what everyone expects.
> What is implemented instead is like FIFO scheduling: As long as there
> is a new
> entry at the head of the queue, the jobs at the tail will never be
> executed.
> 
> Regards,
> Ulrich
> 
> > 
> > 
> > > Thanks,
> > > 
> > > 
> > > Antony.
> > > 
> > 
> > ‑‑ 
> > Ken Gaillot <kgaillot at redhat.com>
> > 
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>