[ClusterLabs] Antw: [EXT] Re: cluster-recheck-interval and failure-timeout
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Tue Apr 6 03:20:17 EDT 2021
Hi Anthony,
are you sure you are not mixing the two:
fail-counts
sticky resource failures
I mean once the fail-count did exceed, you get a sticky resource constraint (a
candidate for "cleanup"). Even if the fail-count is reset after that, the
constraints will still be there.
Regards,
Ulrich
>>> Antony Stone <Antony.Stone at ha.open.source.it> schrieb am 31.03.2021 um
16:48 in
Nachricht <202103311648.54643.Antony.Stone at ha.open.source.it>:
> On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote:
>
>> On Wed, 2021‑03‑31 at 14:32 +0200, Antony Stone wrote:
>>
>> > So, what am I misunderstanding about "failure‑timeout", and what
>> > configuration setting do I need to use to tell pacemaker that "provided
the
>> > resource hasn't failed within the past X seconds, forget the fact that
it
>> > failed more than X seconds ago"?
>>
>> Unfortunately, there is no way. failure‑timeout expires *all* failures
>> once the *most recent* is that old. It's a bit counter‑intuitive but
>> currently, Pacemaker only remembers a resource's most recent failure
>> and the total count of failures, and changing that would be a big
>> project.
>
> So, are you saying that if a resource failed last Friday, and then again on
> Saturday, but has been running perfectly happily ever since, a failure today
>
> will trigger "that's it, we're moving it, it doesn't work here"?
>
> That seems bizarre.
>
> Surely the length of time a resource has been running without problem should
>
> be taken into account when deciding whether the node it's running on is fit
> to
> handle it or not?
>
> My problem is also bigger than that ‑ and I can't believe there isn't a way
> round the following, otherwise people couldn't use pacemaker:
>
> I have "migration‑threshold=3" on most of my resources, and I have three
> nodes.
>
> If a resource fails for the third time (in any period of time) on a node, it
>
> gets moved (along with the rest in the group) to another node. The cluster
> does not forget that it failed and was moved away from the first node,
> though.
>
> "crm status ‑f" confirms that to me.
>
> If it then fails three times (in an hour, or a fortnight, whatever) on the
> second node, it gets moved to node 3, and from that point on the cluster
> thinks there's nowhere else to move it to, so another failure means a total
> failure of the cluster.
>
> There must be _something_ I'm doing wrong for the cluster to behave in this
> way? It can't believe it's by design.
>
>
> Regards,
>
>
> Antony.
>
> ‑‑
> Anyone that's normal doesn't really achieve much.
>
> ‑ Mark Blair, Australian rocket engineer
>
> Please reply to the
list;
> please *don't* CC
> me.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list