[ClusterLabs] cluster-recheck-interval and failure-timeout
Antony Stone
Antony.Stone at ha.open.source.it
Wed Mar 31 10:48:54 EDT 2021
On Wednesday 31 March 2021 at 15:48:15, Ken Gaillot wrote:
> On Wed, 2021-03-31 at 14:32 +0200, Antony Stone wrote:
>
> > So, what am I misunderstanding about "failure-timeout", and what
> > configuration setting do I need to use to tell pacemaker that "provided the
> > resource hasn't failed within the past X seconds, forget the fact that it
> > failed more than X seconds ago"?
>
> Unfortunately, there is no way. failure-timeout expires *all* failures
> once the *most recent* is that old. It's a bit counter-intuitive but
> currently, Pacemaker only remembers a resource's most recent failure
> and the total count of failures, and changing that would be a big
> project.
So, are you saying that if a resource failed last Friday, and then again on
Saturday, but has been running perfectly happily ever since, a failure today
will trigger "that's it, we're moving it, it doesn't work here"?
That seems bizarre.
Surely the length of time a resource has been running without problem should
be taken into account when deciding whether the node it's running on is fit to
handle it or not?
My problem is also bigger than that - and I can't believe there isn't a way
round the following, otherwise people couldn't use pacemaker:
I have "migration-threshold=3" on most of my resources, and I have three
nodes.
If a resource fails for the third time (in any period of time) on a node, it
gets moved (along with the rest in the group) to another node. The cluster
does not forget that it failed and was moved away from the first node, though.
"crm status -f" confirms that to me.
If it then fails three times (in an hour, or a fortnight, whatever) on the
second node, it gets moved to node 3, and from that point on the cluster
thinks there's nowhere else to move it to, so another failure means a total
failure of the cluster.
There must be _something_ I'm doing wrong for the cluster to behave in this
way? It can't believe it's by design.
Regards,
Antony.
--
Anyone that's normal doesn't really achieve much.
- Mark Blair, Australian rocket engineer
Please reply to the list;
please *don't* CC me.
More information about the Users
mailing list