[ClusterLabs] Antw: [EXT] Re: cluster-recheck-interval and failure-timeout
Ulrich.Windl at rz.uni-regensburg.de
Tue Apr 6 03:20:17 EDT 2021
are you sure you are not mixing the two:
sticky resource failures
I mean once the fail-count did exceed, you get a sticky resource constraint (a
candidate for "cleanup"). Even if the fail-count is reset after that, the
constraints will still be there.
>>> Antony Stone <Antony.Stone at ha.open.source.it> schrieb am 31.03.2021 um
Nachricht <202103311648.54643.Antony.Stone at ha.open.source.it>:
> On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote:
>> On Wed, 2021‑03‑31 at 14:32 +0200, Antony Stone wrote:
>> > So, what am I misunderstanding about "failure‑timeout", and what
>> > configuration setting do I need to use to tell pacemaker that "provided
>> > resource hasn't failed within the past X seconds, forget the fact that
>> > failed more than X seconds ago"?
>> Unfortunately, there is no way. failure‑timeout expires *all* failures
>> once the *most recent* is that old. It's a bit counter‑intuitive but
>> currently, Pacemaker only remembers a resource's most recent failure
>> and the total count of failures, and changing that would be a big
> So, are you saying that if a resource failed last Friday, and then again on
> Saturday, but has been running perfectly happily ever since, a failure today
> will trigger "that's it, we're moving it, it doesn't work here"?
> That seems bizarre.
> Surely the length of time a resource has been running without problem should
> be taken into account when deciding whether the node it's running on is fit
> handle it or not?
> My problem is also bigger than that ‑ and I can't believe there isn't a way
> round the following, otherwise people couldn't use pacemaker:
> I have "migration‑threshold=3" on most of my resources, and I have three
> If a resource fails for the third time (in any period of time) on a node, it
> gets moved (along with the rest in the group) to another node. The cluster
> does not forget that it failed and was moved away from the first node,
> "crm status ‑f" confirms that to me.
> If it then fails three times (in an hour, or a fortnight, whatever) on the
> second node, it gets moved to node 3, and from that point on the cluster
> thinks there's nowhere else to move it to, so another failure means a total
> failure of the cluster.
> There must be _something_ I'm doing wrong for the cluster to behave in this
> way? It can't believe it's by design.
> Anyone that's normal doesn't really achieve much.
> ‑ Mark Blair, Australian rocket engineer
> Please reply to the
> please *don't* CC
> Manage your subscription:
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users