[ClusterLabs] cluster-recheck-interval and failure-timeout
Ken Gaillot
kgaillot at redhat.com
Wed Mar 31 09:48:15 EDT 2021
On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote:
> Hi.
>
> I'm trying to understand what looks to me like incorrect behaviour
> between
> cluster-recheck-interval and failure-timeout, under pacemaker 2.0.1
>
> I have three machines in a corosync (3.0.1 if it matters) cluster,
> managing 12
> resources in a single group.
>
> I'm following documentation from:
>
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/
> Pacemaker_Explained/s-cluster-options.html
>
> and
>
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/
> Pacemaker_Explained/s-resource-options.html
>
> I have set a cluster property:
>
> cluster-recheck-interval=60s
>
> I have set a resource property:
>
> failure-timeout=180
>
> The docs say failure-timeout is "How many seconds to wait before
> acting as if
> the failure had not occurred, and potentially allowing the resource
> back to
> the node on which it failed."
>
> I think this should mean that if the resource fails and gets
> restarted, the
> fact that it failed will be "forgotten" after 180 seconds (or maybe a
> little
> longer, depending on exactly when the next cluster recheck is done).
>
> However what I'm seeing is that if the resource fails and gets
> restarted, and
> this then happens an hour later, it's still counted as two
> failures. If it
That is exactly correct.
> fails and gets restarted another hour after that, it's recorded as
> three
> failures and (because I have "migration-threshold=3") it gets moved
> to another
> node (and therefore all the other resources in group are moved as
> well).
>
> So, what am I misunderstanding about "failure-timeout", and what
> configuration
> setting do I need to use to tell pacemaker that "provided the
> resource hasn't
> failed within the past X seconds, forget the fact that it failed more
> than X
> seconds ago"?
Unfortunately, there is no way. failure-timeout expires *all* failures
once the *most recent* is that old. It's a bit counter-intuitive but
currently, Pacemaker only remembers a resource's most recent failure
and the total count of failures, and changing that would be a big
project.
> Thanks,
>
>
> Antony.
>
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list