[ClusterLabs] Antw: [EXT] Re: cluster-recheck-interval and failure-timeout

Tue Apr 6 03:20:17 EDT 2021

Hi Anthony,

are you sure you are not mixing the two:
fail-counts
sticky resource failures

I mean once the fail-count did exceed, you get a sticky resource constraint (a
candidate for "cleanup"). Even if the fail-count is reset after that, the
constraints will still be there.

Regards,
Ulrich

>>> Antony Stone <Antony.Stone at ha.open.source.it> schrieb am 31.03.2021 um
16:48 in
Nachricht <202103311648.54643.Antony.Stone at ha.open.source.it>:
> On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote:
> 
>> On Wed, 2021‑03‑31 at 14:32 +0200, Antony Stone wrote:
>>
>> > So, what am I misunderstanding about "failure‑timeout", and what
>> > configuration setting do I need to use to tell pacemaker that "provided
the
>> > resource hasn't failed within the past X seconds, forget the fact that
it
>> > failed more than X seconds ago"?
>> 
>> Unfortunately, there is no way. failure‑timeout expires *all* failures
>> once the *most recent* is that old. It's a bit counter‑intuitive but
>> currently, Pacemaker only remembers a resource's most recent failure
>> and the total count of failures, and changing that would be a big
>> project.
> 
> So, are you saying that if a resource failed last Friday, and then again on

> Saturday, but has been running perfectly happily ever since, a failure today

> 
> will trigger "that's it, we're moving it, it doesn't work here"?
> 
> That seems bizarre.
> 
> Surely the length of time a resource has been running without problem should

> 
> be taken into account when deciding whether the node it's running on is fit

> to 
> handle it or not?
> 
> My problem is also bigger than that ‑ and I can't believe there isn't a way

> round the following, otherwise people couldn't use pacemaker:
> 
> I have "migration‑threshold=3" on most of my resources, and I have three 
> nodes.
> 
> If a resource fails for the third time (in any period of time) on a node, it

> 
> gets moved (along with the rest in the group) to another node.  The cluster

> does not forget that it failed and was moved away from the first node, 
> though.
> 
> "crm status ‑f" confirms that to me.
> 
> If it then fails three times (in an hour, or a fortnight, whatever) on the 
> second node, it gets moved to node 3, and from that point on the cluster 
> thinks there's nowhere else to move it to, so another failure means a total

> failure of the cluster.
> 
> There must be _something_ I'm doing wrong for the cluster to behave in this

> way?  It can't believe it's by design.
> 
> 
> Regards,
> 
> 
> Antony.
> 
> ‑‑ 
> Anyone that's normal doesn't really achieve much.
> 
>  ‑ Mark Blair, Australian rocket engineer
> 
>                                                    Please reply to the
list;
>                                                          please *don't* CC 
> me.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/